Linguistic knowledge in statistical phrase-based word alignment

Authors:
A. De Gispert;J. B. Mariño
Affiliations:
TALP Research Center, Universitat Politècnica de Catalunya (UPC), Jordi Girona 1-3, Campus Nord D5, 08034 Barcelona, Spain e-mail: agispert@gps.tsc.upc.es, canton@gps.tsc.upc.es;TALP Research Center, Universitat Politècnica de Catalunya (UPC), Jordi Girona 1-3, Campus Nord D5, 08034 Barcelona, Spain e-mail: agispert@gps.tsc.upc.es, canton@gps.tsc.upc.es
Venue:
Natural Language Engineering
Year:
2006

Citing 12
Cited 2

Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
A systematic comparison of various statistical alignment models

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Word alignment with cohesion constraint

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
A probability model to improve word alignment

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Extensions to HMM-based statistical word alignment models

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A phrase-based, joint probability model for statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
An evaluation exercise for word alignment

HLT-NAACL-PARALLEL '03 Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

Joining linguistic and statistical methods for Spanish-to-Basque speech translation

Speech Communication
Incorporating Linguistic Information to Statistical Word-Level Alignment

CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a novel phrase alignment strategy combining linguistic knowledge and cooccurrence measures extracted from bilingual corpora is presented. The algorithm is mainly divided into four steps, namely phrase selection and classification, phrase alignment, one-to-one word alignment and postprocessing. The first stage selects a linguistically-derived set of phrases that convey a unified meaning during translation and are therefore aligned together in parallel texts. These phrases include verb phrases, idiomatic expressions and date expressions. During the second stage, very high precision links between these selected phrases for both languages are produced. The third step performs a statistical word alignment using association measures and link probabilities with the remaining unaligned tokens, and finally the fourth stage takes final decisions on unaligned tokens based on linguistic knowledge. Experiments are reported for an English-Spanish parallel corpus, with a detailed description of the evaluation measure and manual reference used. Results show that phrase cooccurrence measures convey a complementary information to word cooccurrences and a stronger evidence of a correct alignment, successfully introducing linguistic knowledge in a statistical word alignment scheme. Precision, Recall and Alignment Error Rate (AER) results are presented, outperforming state-of-the-art alignment algorithms.