Automatic identification of parallel documents with light or without linguistic resources

  • Authors:
  • Alexandre Patry;Philippe Langlais

  • Affiliations:
  • Laboratoire de Recherche Appliquée en Linguistique Informatique, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, Qu&# ...;Laboratoire de Recherche Appliquée en Linguistique Informatique, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, Qu&# ...

  • Venue:
  • AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallel corpora are playing a crucial role in multilingual natural language processing Unfortunately, the availability of such a resource is the bottleneck in most applications of interest Mining the web for parallel corpora is a viable solution that comes at a price: it is not always easy to identify parallel documents among the crawled material In this study we address the problem of automatically identifying the pairs of texts that are translation of each other in a set of documents We show that it is possible to automatically build particularly efficient content-based methods that make use of very little lexical knowledge We also evaluate our approach toward a front-end translation task and demonstrate that our parallel text classifier yields better performances than another approach based on a rich lexicon.