Fast, easy, and cheap: construction of statistical machine translation models with MapReduce

Authors:
Christopher Dyer;Aaron Cordova;Alex Mont;Jimmy Lin
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD;University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Year:
2008

Citing 14
Cited 13

Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
A systematic comparison of various statistical alignment models

Computational Linguistics
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
A comparison of alignment models for statistical machine translation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
HMM-based word alignment in statistical translation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Discriminative training and maximum entropy models for statistical machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Scaling phrase-based statistical machine translation to larger corpora and longer phrases

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
MTTK: an alignment toolkit for statistical machine translation

NAACL-Demonstrations '06 Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume: demonstrations
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Machine translation by pattern matching

Machine translation by pattern matching
Exploring large-data issues in the curriculum: a case study with MapReduce

TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

Tera-scale translation models via pattern matching

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Evaluating SPLASH-2 Applications Using MapReduce

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Exploring large-data issues in the curriculum: a case study with MapReduce

TeachCL '08 Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Design patterns for efficient graph algorithms in MapReduce

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Distributed asynchronous online learning for natural language processing

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Rapid parallel genome indexing with MapReduce

Proceedings of the second international workshop on MapReduce and its applications
A distributed look-up architecture for text mining applications using MapReduce

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Mr. LDA: a flexible large scale topic modeling package using variational inference in MapReduce

Proceedings of the 21st international conference on World Wide Web
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Breaking the MapReduce stage barrier

Cluster Computing
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, the quantity of parallel training data available for statistical machine translation has increased far more rapidly than the performance of individual computers, resulting in a potentially serious impediment to progress. Parallelization of the model-building algorithms that process this data on computer clusters is fraught with challenges such as synchronization, data exchange, and fault tolerance. However, the MapReduce programming paradigm has recently emerged as one solution to these issues: a powerful functional abstraction hides system-level details from the researcher, allowing programs to be transparently distributed across potentially very large clusters of commodity hardware. We describe MapReduce implementations of two algorithms used to estimate the parameters for two word alignment models and one phrase-based translation model, all of which rely on maximum likelihood probability estimates. On a 20-machine cluster, experimental results show that our solutions exhibit good scaling characteristics compared to a hypothetical, optimally-parallelized version of current state-of-the-art single-core tools.