Automatic identification of parallel documents with light or without linguistic resources

Authors:
Alexandre Patry;Philippe Langlais
Affiliations:
Laboratoire de Recherche Appliquée en Linguistique Informatique, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, Qu&# ...;Laboratoire de Recherche Appliquée en Linguistique Informatique, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, Qu&# ...
Venue:
AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Year:
2005

Citing 7
Cited 4

Neural networks for pattern recognition

Neural networks for pattern recognition
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Methods and practical issues in evaluating alignment techniques

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Automatic identification of word translations from unrelated English and German corpora

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Aligning and using an English-Inuktitut parallel corpus

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

Feature-based method for document alignment in comparable news corpora

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A fast method for parallel document identification

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Identifying parallel documents from a large bilingual collection of texts: application to parallel article extraction in Wikipedia

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
New approach for collecting high quality parallel corpora from multilingual websites

Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel corpora are playing a crucial role in multilingual natural language processing Unfortunately, the availability of such a resource is the bottleneck in most applications of interest Mining the web for parallel corpora is a viable solution that comes at a price: it is not always easy to identify parallel documents among the crawled material In this study we address the problem of automatically identifying the pairs of texts that are translation of each other in a set of documents We show that it is possible to automatically build particularly efficient content-based methods that make use of very little lexical knowledge We also evaluate our approach toward a front-end translation task and demonstrate that our parallel text classifier yields better performances than another approach based on a rich lexicon.