Phrase-based statistical language modeling from bilingual parallel corpus

  • Authors:
  • Jun Mao;Gang Cheng;Yanxiang He

  • Affiliations:
  • Computer School, Wuhan University, Wuhan, China;Computer School, Wuhan University, Wuhan, China;Computer School, Wuhan University, Wuhan, China

  • Venue:
  • ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Phrase-based models and class-based models are both variants of classical n-gram models. In this paper, we propose an approach by merging phrase-based models and class-based models together. In the phrase-based part, we use bilingual parallel corpus to extract phrases with a method deriving from phrase-based translation models. Then we partition these phrases into phrase classes by minimizing the loss of the average mutual information with the aid of a count matrix. Our experimental results suggest that phrase-based models can capture more key information than word-based models and class-based models can capture the relationship among similar words or phrases and thus solve the problem of data sparseness in some sense.