On the use of comparable corpora to improve SMT performance

Authors:
Sadaf AbduI-Rauf;Holger Schwenk
Affiliations:
University of Le Mans, France;University of Le Mans, France
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 11
Cited 17

A systematic comparison of various statistical alignment models

Computational Linguistics
Automatic construction of English/Chinese parallel corpora

Journal of the American Society for Information Science and Technology
The Web as a parallel corpus

Computational Linguistics - Special issue on web as corpus
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Discriminative training and maximum entropy models for statistical machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Reliable measures for aligning Japanese-English news articles and sentences

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation

Exploiting comparable corpora with TER and TERp

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
LIUM SMT machine translation system for WMT 2010

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Example-based paraphrasing for improved phrase-based statistical machine translation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Large scale parallel document mining for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Two ways to use a noisy parallel news corpus for improving statistical machine translation

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Identifying parallel documents from a large bilingual collection of texts: application to parallel article extraction in Wikipedia

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Parallel sentence generation from comparable corpora for improved SMT

Machine Translation
LIUM's SMT machine translation systems for WMT 2011

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Toward statistical machine translation without parallel corpora

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Towards effective use of training data in statistical machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
LIUM's SMT machine translation systems for WMT 2012

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
A language modeling approach for extracting translation knowledge from comparable corpora

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Domain adaptation in statistical machine translation using comparable corpora: case study for english latvian IT localisation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
An intelligent Web agent that autonomously learns how to translate

Web Intelligence and Agent Systems
A new fuzzy rule-based classification system for word sense disambiguation

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation (SMT) system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create French/English parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems.