Improving statistical word alignments with morpho-syntactic transformations

Authors:
Adrià de Gispert;Deepa Gupta;Maja Popović;Patrik Lambert;Jose B. Mariño;Marcello Federico;Hermann Ney;Rafael Banchs
Affiliations:
TALP Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy;Lehrstuhl für Informatik 6, RWTH Aachen University, Aachen, Germany;TALP Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain;TALP Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy;Lehrstuhl für Informatik 6, RWTH Aachen University, Aachen, Germany;TALP Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain
Venue:
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Year:
2006

Citing 15
Cited 1

Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
Phrase-Based Statistical Machine Translation

KI '02 Proceedings of the 25th Annual German Conference on AI: Advances in Artificial Intelligence
A systematic comparison of various statistical alignment models

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A comparison of alignment models for statistical machine translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Combining clues for word alignment

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
A syntax-based statistical translation model

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
An unsupervised method for word sense tagging using parallel corpora

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Extensions to HMM-based statistical word alignment models

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Experiments in parallel-text based grammar induction

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Improving word alignment quality using morpho-syntactic information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Phrase linguistic classification and generalization for improving statistical machine translation

ACLstudent '05 Proceedings of the ACL Student Research Workshop
TALP phrase-based statistical translation system for European language pairs

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation

Reduction of Morpho-Syntactic Features in Statistical Machine Translation of Highly Inflective Language

Informatica

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a wide range of statistical word alignment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS-tagging, lemmatization or stemming, we explore which linguistic information helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing morphosyntactic information are bigger in case of data scarcity, but significant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability.