Building Bilingual Dictionaries from Parallel Web Documents
Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Computational Linguistics - Special issue on web as corpus
An unsupervised method for word sense tagging using parallel corpora
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Relieving the data acquisition bottleneck in word sense disambiguation
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Automatic identification of parallel documents with light or without linguistic resources
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
A minimally supervised approach for detecting and ranking document translation pairs
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Hi-index | 0.00 |
We present a fast method to identify homogeneous parallel documents. The method is based on collecting counts of identical low-frequency words between possibly parallel documents. The candidate with the most shared low-frequency words is selected as the parallel document. The method achieved 99.96% accuracy when tested on the EUROPARL corpus of parliamentary proceedings, failing only in anomalous cases of truncated or otherwise distorted documents. While other work has shown similar performance on this type of dataset, our approach presented here is faster and does not require training. Apart from proposing an efficient method for parallel document identification in a restricted domain, this paper furnishes evidence that parliamentary proceedings may be inappropriate for testing parallel document identification systems in general.