Stream-based translation models for statistical machine translation

Authors:
Abby Levenberg;Chris Callison-Burch;Miles Osborne
Affiliations:
University of Edinburgh;Johns Hopkins University;University of Edinburgh
Venue:
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2010

Citing 15
Cited 5

Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
A systematic comparison of various statistical alignment models

Computational Linguistics
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Scaling phrase-based statistical machine translation to larger corpora and longer phrases

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Hierarchical Phrase-Based Translation

Computational Linguistics
Tera-scale translation models via pattern matching

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Online EM for unsupervised models

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Experiments in domain adaptation for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Joshua: an open source toolkit for parsing-based machine translation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Domain adaptation for statistical machine translation with monolingual resources

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Stream-based randomised language models for SMT

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Dynamic extended suffix arrays

Journal of Discrete Algorithms

Approximate scalable bounded space sketch for large data NLP

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Active learning for interactive machine translation

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
LetsMT!: a cloud-based platform for do-it-yourself machine translation

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Fast large-scale approximate graph construction for NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Selecting data for English-to-Czech machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Typical statistical machine translation systems are trained with static parallel corpora. Here we account for scenarios with a continuous incoming stream of parallel training data. Such scenarios include daily governmental proceedings, sustained output from translation agencies, or crowd-sourced translations. We show incorporating recent sentence pairs from the stream improves performance compared with a static baseline. Since frequent batch retraining is computationally demanding we introduce a fast incremental alternative using an online version of the EM algorithm. To bound our memory requirements we use a novel data-structure and associated training regime. When compared to frequent batch retraining, our online time and space-bounded model achieves the same performance with significantly less computational overhead.