A fast and accurate method for detecting English-Japanese parallel texts

Authors:
Ken'ichi Fukushima;Kenjiro Taura;Takashi Chikayama
Affiliations:
University of Tokyo;University of Tokyo;University of Tokyo
Venue:
MLRI '06 Proceedings of the Workshop on Multilingual Language Resources and Interoperability
Year:
2006

Citing 3
Cited 1

A statistical approach to machine translation

Computational Linguistics
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Automatic construction of parallel English-Chinese corpus for cross-language information retrieval

ANLC '00 Proceedings of the sixth conference on Applied natural language processing

QRselect: a user-driven system for collecting translation document pairs from the web

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel corpus is a valuable resource used in various fields of multilingual natural language processing. One of the most significant problems in using parallel corpora is the lack of their availability. Researchers have investigated approaches to collecting parallel texts from the Web. A basic component of these approaches is an algorithm that judges whether a pair of texts is parallel or not. In this paper, we propose an algorithm that accelerates this task without losing accuracy by preprocessing a bilingual dictionary as well as the collection of texts. This method achieved 250,000 pairs/sec throughput on a single CPU, with the best F1 score of 0.960 for the task of detecting 200 Japanese-English translation pairs out of 40,000. The method is applicable to texts of any format, and not specific to HTML documents labeled with URLs. We report details of these preprocessing methods and the fast comparison algorithm. To the best of our knowledge, this is the first reported experiment of extracting Japanese-English parallel texts from a large corpora based solely on linguistic content.