Automatic filtering of bilingual corpora for statistical machine translation

Authors:
Shahram Khadivi;Hermann Ney
Affiliations:
Lehrstuhl für Informatik VI – Computer Science Department, RWTH Aachen University, Aachen, Germany;Lehrstuhl für Informatik VI – Computer Science Department, RWTH Aachen University, Aachen, Germany
Venue:
NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems
Year:
2005

Citing 11
Cited 8

A statistical approach to machine translation

Computational Linguistics
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Automatic construction of parallel English-Chinese corpus for cross-language information retrieval

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Using noisy bilingual data for statistical machine translation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Automatic construction of machine translation knowledge using translation literalness

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Discriminative training and maximum entropy models for statistical machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Efficient optimization for bilingual sentence alignment based on linear regression

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

Filtering or adapting: two strategies to exploit noisy parallel corpora for cross-language information retrieval

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
UPV-PRHLT English-Spanish system for WMT10

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Improving accuracy of identifying clinical concepts in noisy unstructured clinical notes using existing internal redundancy

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
An approach for adding noise-tolerance to restricted-domain information retrieval

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Unsupervised cleansing of noisy text

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
UPM system for the translation task

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
UPM system for WMT 2012

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

For many applications such as machine translation and bilingual information retrieval, the bilingual corpora play an important role in training the system. Because they are obtained through automatic or semi automatic methods, they usually include noise, sentence pairs which are worthless or even harmful for training the system. We study the effect of different levels of corpus noise on an end-to-end statistical machine translation system. We also propose an efficient method for corpus filtering. This method filters out the noisy part of a corpus based on the state-of-the-art word alignment models. We show the efficiency of this method on the basis of the sentence misalignment rate of the filtered corpus and its positive effect on the translation quality.