Data cleaning for word alignment

Authors:
Tsuyoshi Okita
Affiliations:
Dublin City University, Glasnevin, Dublin
Venue:
ACLstudent '09 Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Year:
2009

Citing 20
Cited 1

Induction of semantic classes from natural language text

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A systematic comparison of various statistical alignment models

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Automatic construction of machine translation knowledge using translation literalness

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Precision and recall of machine translation

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
A phrase-based, joint probability model for statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Paraphrasing with bilingual parallel corpora

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A discriminative framework for bilingual word alignment

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
HMM word and phrase alignment for statistical machine translation

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Improved statistical machine translation using paraphrases

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Alignment by agreement

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating machine translation with LFG dependencies

Machine Translation
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
ParaMetric: an automatic evaluation metric for paraphrasing

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Kernel regression based machine translation

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Parallel implementations of word alignment tool

SETQA-NLP '08 Software Engineering, Testing, and Quality Assurance for Natural Language Processing

Consistency checking for Treebank alignment

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop

Quantified Score

Hi-index	0.01

Visualization

Abstract

Parallel corpora are made by human beings. However, as an MT system is an aggregation of state-of-the-art NLP technologies without any intervention of human beings, it is unavoidable that quite a few sentence pairs are beyond its analysis and that will therefore not contribute to the system. Furthermore, they in turn may act against our objectives to make the overall performance worse. Possible unfavorable items are n: m mapping objects, such as paraphrases, non-literal translations, and multiword expressions. This paper presents a pre-processing method which detects such unfavorable items before supplying them to the word aligner under the assumption that their frequency is low, such as below 5 percent. We show an improvement of Bleu score from 28.0 to 31.4 in English-Spanish and from 16.9 to 22.1 in German-English.