Introduction to the special issue on the web as corpus
Computational Linguistics - Special issue on web as corpus
Computational Linguistics - Special issue on web as corpus
Computational Linguistics - Special issue on web as corpus
Discovering parallel text from the World Wide Web
ACSW Frontiers '04 Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32
Mining the Web for bilingual text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A DOM tree alignment model for mining parallel data from the web
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Automatic acquisition of chinese–english parallel corpus from the web
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Cross-lingual text categorization: Conquering language boundaries in globalized environments
Information Processing and Management: an International Journal
Exploiting poly-lingual documents for improving text categorization effectiveness
Decision Support Systems
Hi-index | 0.00 |
This paper describes an intelligent agent to facilitate bitext mining from the Web via automatic discovery of URL pairing patterns (or keys) for retrieving parallel web pages. The linking power of a key, defined as the number of URL pairs that it can match, is used as the objective function for the search for the best set of keys that can find the greatest number of web page pairs within a bilingual website. Our experiments show that, with no prior knowledge such as ad hoc heuristics, no labelled data for training and no similarity analysis of Web page structure and content that are commonly involved in the existing approaches, a best-first search to approximate this optimization with an empirical threshold can recognize 98.1% true parallel web pages and discover many irregular pairing patterns that are unlikely to be discovered by other approaches.