Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system

Authors:
Srinivas Bangalore;Vanessa Murdock;Giuseppe Riccardi
Affiliations:
AT&T Labs-Research, Florham Park, NJ;University of Massachusetts, Amherst, MA;AT&T Labs-Research, Florham Park, NJ
Venue:
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Year:
2002

Citing 6
Cited 9

Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Hubbub: a wireless instant messenger that uses earcons for awareness and for "sound instant messages"

CHI '01 Extended Abstracts on Human Factors in Computing Systems
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Part of speech tagging using a network of linear separators

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Automatic acquisition of hierarchical transduction models for machine translation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Sentence Fusion for Multidocument News Summarization

Computational Linguistics
The Long-Term Forecast for Weather Bulletin Translation

Machine Translation
A phrase-based statistical model for SMS text normalization

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
A hybrid rule/model-based finite-state framework for normalizing SMS messages

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Contextual bearing on linguistic variation in social media

LSM '11 Proceedings of the Workshop on Languages in Social Media
Unsupervised mining of lexical variants from noisy text

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Sentence fusion for multidocument news summarization

Computational Linguistics
Normalization of informal text

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the primary issues in training statistical translation models is the paucity of bilingual data. In this paper, we propose techniques to alleviate the bilingual data bottleneck by creating a consensus from translations of monolingual data provided by several off-the-shelf translation engines. We compute the consensus alignment using a multi-sequence alignment algorithm used for DNA sequence alignment. We present an application of this technique to bootstrap bilingual data for the general domain of instant messaging. We train hierarchical statistical translation models on the bootstrapped bilingual data and show that the resulting statistical translation model outperforms each individual off-the-shelf translation system.