The scaling problem in the pattern recognition approach to machine translation

Authors:
D. Ortiz-Martínez;I. García-Varea;F. Casacuberta
Affiliations:
Departament de Sistemes Informátics i Computació, Universitat Politècnica de València, Spain;Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha, Spain;Departament de Sistemes Informátics i Computació, Universitat Politècnica de València, Spain
Venue:
Pattern Recognition Letters
Year:
2008

Citing 8
Cited 1

A statistical approach to machine translation

Computational Linguistics
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Phrase-Based Statistical Machine Translation

KI '02 Proceedings of the 25th Annual German Conference on AI: Advances in Artificial Intelligence
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Effective phrase translation extraction from alignment models

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A phrase-based, joint probability model for statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
A projection extension algorithm for statistical machine translation

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Scaling phrase-based statistical machine translation to larger corpora and longer phrases

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

Online learning for interactive statistical machine translation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.10

Visualization

Abstract

Statistical machine translation (SMT) has proven to be an interesting pattern recognition framework for automatically building machine translations systems from available parallel corpora. In the last few years, research in SMT has been characterized by two significant advances. First, the popularization of the so called phrase-based statistical translation models, which allows to incorporate local contextual information to the translation models. Second, the availability of larger and larger parallel corpora, which are composed of millions of sentence pairs, and tens of millions of running words. Since phrase-based models basically consists in statistical dictionaries of phrase pairs, their estimation from very large corpora is a very costly task that yields a huge number of parameters which are to be stored in memory. The handling of millions of model parameters and a similar number of training samples have become a bottleneck in the field of SMT, as well as in other well-known pattern recognition tasks such as speech recognition or handwritten recognition, just to name a few. In this paper, we propose a general framework that deals with the scaling problem in SMT without introducing significant time overhead by means of the combination of different scaling techniques. This new framework is based on the use of counts instead of probabilities, and on the concept of cache memory.