Phrase-based statistical language modeling from bilingual parallel corpus

Authors:
Jun Mao;Gang Cheng;Yanxiang He
Affiliations:
Computer School, Wuhan University, Wuhan, China;Computer School, Wuhan University, Wuhan, China;Computer School, Wuhan University, Wuhan, China
Venue:
ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Year:
2007

Citing 3
Cited 1

Class-based n-gram models of natural language

Computational Linguistics
A systematic comparison of various statistical alignment models

Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

A trigram statistical language model algorithm for Chinese word segmentation

FAW'07 Proceedings of the 1st annual international conference on Frontiers in algorithmics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Phrase-based models and class-based models are both variants of classical n-gram models. In this paper, we propose an approach by merging phrase-based models and class-based models together. In the phrase-based part, we use bilingual parallel corpus to extract phrases with a method deriving from phrase-based translation models. Then we partition these phrases into phrase classes by minimizing the loss of the average mutual information with the aid of a count matrix. Our experimental results suggest that phrase-based models can capture more key information than word-based models and class-based models can capture the relationship among similar words or phrases and thus solve the problem of data sparseness in some sense.