Self-organized language modeling for speech recognition
Readings in speech recognition
Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese
ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Recognizing unregistered names for Mandarin word identification
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
Chinese text retrieval without using a dictionary
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Language model based arabic word segmentation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A maximum entropy Chinese character-based parser
EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
A novel statistical chinese language model and its application in pinyin-to-character conversion
Proceedings of the 17th ACM conference on Information and knowledge management
Combining Language Modeling and Discriminative Classification for Word Segmentation
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Class-Based language models for chinese-english parallel corpus
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Hi-index | 0.00 |
We present an iterative procedure to build a Chinese language model (LM). We segment Chinese text into words based on a word-based Chinese language model. However, the construction of a Chinese LM itself requires word boundaries. To get out of the chicken-and-egg problem, we propose an iterative procedure that alternates two operations: segmenting text into words and building an LM. Starting with an initial segmented corpus and an LM based upon it, we use a Viterbi-liek algorithm to segment another set of data. Then, we build an LM based on the second set and use the resulting LM to segment again the first corpus. The alternating procedure provides a self-organized way for the segmenter to detect automatically unseen words and correct segmentation errors. Our preliminary experiment shows that the alternating procedure not only improves the accuracy of our segmentation, but discovers unseen words suprisingly well. The resulting word-based LM has a perplexity of 188 for a general Chinese corpus.