Modeling of long distance context dependency in Chinese

Authors:
GuoDong Zhou
Affiliations:
Institute for Infocomm Research, Singapore
Venue:
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Year:
2003

Citing 6
Cited 0

Self-organized language modeling for speech recognition

Readings in speech recognition
Poor estimates of context are worse than none

HLT '90 Proceedings of the workshop on Speech and Natural Language
Class-based n-gram models of natural language

Computational Linguistics
Structural ambiguity and lexical relations

Computational Linguistics - Special issue on using large corpora: I
Word association and MI-Trigger-based language modeling

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Exploring asymmetric clustering for statistical language modeling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ngram modeling is simple in language modeling and has been widely used in many applications. However, it can only capture the short distance context dependency within an N-word window where the largest practical N for natural language is three. In the meantime, much of context dependency in natural language occurs beyond a three-word window. In order to incorporate this kind of long distance context dependency, this paper proposes a new MI-Ngram modeling approach. The MI-Ngram model consists of two components: an ngram model and an MI model. The ngram model captures the short distance context dependency within an N-word window while the MI model captures the long distance context dependency between the word pairs beyond the N-word window by using the concept of mutual information. It is found that MI-Ngram modeling has much better performance than ngram modeling. Evaluation on the XINHUA new corpus of 29 million words shows that inclusion of the best 1,600,000 word pairs decreases the perplexity of the MI-Trigram model by 20 percent compared with the trigram model. In the meanwhile, evaluation on Chinese word segmentation shows that about 35 percent of errors can be corrected by using the MI-Trigram model compared with the trigram model.