Identifying parallel documents from a large bilingual collection of texts: application to parallel article extraction in Wikipedia

  • Authors:
  • Alexandre Patry;Philippe Langlais

  • Affiliations:
  • KeaText, Boulevard Dcarie, bureau, Saint-Laurent, Canada;DIRO/RALI, Université de Montréal, Montréal, Canada

  • Venue:
  • BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present Paradocs, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of controlled tasks. We applied it on the French-English cross-language linked article pairs of Wikipedia in order see whether parallel articles in this resource are available, and if our system is able to locate them. According to some manual evaluation we conducted, a fourth of the article pairs in Wikipedia are indeed in translation relation, and Paradocs identifies parallel or noisy parallel article pairs with a precision of 80%.