Machine Learning
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Query-biased web page summarisation: a task-oriented evaluation
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A brief survey of web data extraction tools
ACM SIGMOD Record
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Computational Linguistics - Special issue on web as corpus
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Automatic extraction of dynamic record sections from search engine result pages
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A DOM tree alignment model for mining parallel data from the web
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
A novel discourse parser based on support vector machine classification
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Mining bilingual data from the web with adaptively learnt patterns
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Automatic acquisition of chinese–english parallel corpus from the web
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Hi-index | 0.01 |
A new approach has been developed for acquiring bilingual web pages from the result pages of search engines, which is composed of two challenging tasks. The first task is to detect web records embedded in the result pages automatically via a clustering method of a sample page. Identifying these useful records through the clustering method allows the generation of highly effective features for the next task which is high-quality bilingual web page acquisition. The task of high-quality bilingual web page acquisition is a classification problem. One advantage of our approach is that it is search engine and domain independent. The test is based on 2516 records extracted from six search engines automatically and annotated manually, which gets a high precision of 81.3% and a recall of 94.93%. The experimental results indicate that our approach is very effective.