A large scale distributed syntactic, semantic and lexical language model for machine translation

  • Authors:
  • Ming Tan;Wenli Zhou;Lei Zheng;Shaojun Wang

  • Affiliations:
  • Wright State University, Dayton, OH;Wright State University, Dayton, OH;Wright State University, Dayton, OH;Wright State University, Dayton, OH

  • Venue:
  • HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents an attempt at building a large scale distributed composite language model that simultaneously accounts for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content under a directed Markov random field paradigm. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm that has linear time complexity and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and "readability" when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.