An iterative algorithm to build Chinese language models

Authors:
Xiaoqiang Luo;Salim Roukos
Affiliations:
The Johns Hopkins University, Baltimore, MD;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Year:
1996

Citing 4
Cited 6

Self-organized language modeling for speech recognition

Readings in speech recognition
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Recognizing unregistered names for Mandarin word identification

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4

Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Language model based arabic word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A maximum entropy Chinese character-based parser

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
A novel statistical chinese language model and its application in pinyin-to-character conversion

Proceedings of the 17th ACM conference on Information and knowledge management
Combining Language Modeling and Discriminative Classification for Word Segmentation

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Class-Based language models for chinese-english parallel corpus

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an iterative procedure to build a Chinese language model (LM). We segment Chinese text into words based on a word-based Chinese language model. However, the construction of a Chinese LM itself requires word boundaries. To get out of the chicken-and-egg problem, we propose an iterative procedure that alternates two operations: segmenting text into words and building an LM. Starting with an initial segmented corpus and an LM based upon it, we use a Viterbi-liek algorithm to segment another set of data. Then, we build an LM based on the second set and use the resulting LM to segment again the first corpus. The alternating procedure provides a self-organized way for the segmenter to detect automatically unseen words and correct segmentation errors. Our preliminary experiment shows that the alternating procedure not only improves the accuracy of our segmentation, but discovers unseen words suprisingly well. The resulting word-based LM has a perplexity of 188 for a general Chinese corpus.