A novel method for bilingual web page acquisition from search engine web records

Authors:
Yanhui Feng;Yu Hong;Zhenxiang Yan;Jianmin Yao;Qiaoming Zhu
Affiliations:
Soochow University;Soochow University;Soochow University;Soochow University;Soochow University
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 19
Cited 0

Support-Vector Networks

Machine Learning
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Query-biased web page summarisation: a task-oriented evaluation

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A brief survey of web data extraction tools

ACM SIGMOD Record
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A DOM tree alignment model for mining parallel data from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
A novel discourse parser based on support vector machine classification

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Mining bilingual data from the web with adaptively learnt patterns

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Automatic acquisition of chinese–english parallel corpus from the web

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

A new approach has been developed for acquiring bilingual web pages from the result pages of search engines, which is composed of two challenging tasks. The first task is to detect web records embedded in the result pages automatically via a clustering method of a sample page. Identifying these useful records through the clustering method allows the generation of highly effective features for the next task which is high-quality bilingual web page acquisition. The task of high-quality bilingual web page acquisition is a classification problem. One advantage of our approach is that it is search engine and domain independent. The test is based on 2516 records extracted from six search engines automatically and annotated manually, which gets a high precision of 81.3% and a recall of 94.93%. The experimental results indicate that our approach is very effective.