Identifying parallel documents from a large bilingual collection of texts: application to parallel article extraction in Wikipedia

Authors:
Alexandre Patry;Philippe Langlais
Affiliations:
KeaText, Boulevard Dcarie, bureau, Saint-Laurent, Canada;DIRO/RALI, Université de Montréal, Montréal, Canada
Venue:
BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Year:
2011

Citing 11
Cited 1

Adaptive Parallel Sentences Mining from Web Bilingual News Collection

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
A DOM tree alignment model for mining parallel data from the web

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Information arbitrage across multi-lingual Wikipedia

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Directions for exploiting asymmetries in multilingual Wikipedia

CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
On the use of comparable corpora to improve SMT performance

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A fast method for parallel document identification

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Findings of the 2009 workshop on statistical machine translation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Extracting parallel sentences from comparable corpora using document level alignment

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Automatic identification of parallel documents with light or without linguistic resources

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence

A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present Paradocs, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of controlled tasks. We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them. According to some manual evaluation we conducted, a fourth of the article pairs in Wikipedia are indeed in translation relation, and Paradocs identifies parallel or noisy parallel article pairs with a precision of 80%.