A statistical approach to machine translation
Computational Linguistics
Computational Linguistics - Special issue on web as corpus
Automatic construction of parallel English-Chinese corpus for cross-language information retrieval
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
QRselect: a user-driven system for collecting translation document pairs from the web
ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Hi-index | 0.00 |
Parallel corpus is a valuable resource used in various fields of multilingual natural language processing. One of the most significant problems in using parallel corpora is the lack of their availability. Researchers have investigated approaches to collecting parallel texts from the Web. A basic component of these approaches is an algorithm that judges whether a pair of texts is parallel or not. In this paper, we propose an algorithm that accelerates this task without losing accuracy by preprocessing a bilingual dictionary as well as the collection of texts. This method achieved 250,000 pairs/sec throughput on a single CPU, with the best F1 score of 0.960 for the task of detecting 200 Japanese-English translation pairs out of 40,000. The method is applicable to texts of any format, and not specific to HTML documents labeled with URLs. We report details of these preprocessing methods and the fast comparison algorithm. To the best of our knowledge, this is the first reported experiment of extracting Japanese-English parallel texts from a large corpora based solely on linguistic content.