Adaptive Parallel Sentences Mining from Web Bilingual News Collection
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Computational Linguistics - Special issue on web as corpus
A DOM tree alignment model for mining parallel data from the web
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Information arbitrage across multi-lingual Wikipedia
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Directions for exploiting asymmetries in multilingual Wikipedia
CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
On the use of comparable corpora to improve SMT performance
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A fast method for parallel document identification
NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Findings of the 2009 workshop on statistical machine translation
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Extracting parallel sentences from comparable corpora using document level alignment
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Automatic identification of parallel documents with light or without linguistic resources
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
A minimally supervised approach for detecting and ranking document translation pairs
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Hi-index | 0.00 |
While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present Paradocs, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of controlled tasks. We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them. According to some manual evaluation we conducted, a fourth of the article pairs in Wikipedia are indeed in translation relation, and Paradocs identifies parallel or noisy parallel article pairs with a precision of 80%.