Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing (TALIP)
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Exploring asymmetric clustering for statistical language modeling
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Shrinking exponential language models
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A Bayesian methodology for semi-automated task analysis
Proceedings of the 2007 conference on Human interface: Part I
Semantic spaces for improving language modeling
Computer Speech and Language
Hi-index | 0.00 |
A new word-clustering technique is proposed to efficiently build statistically salient class 2-grams from language corpora. By splitting word neighboring characteristics into word-preceding and following directions, multiple (two-dimensional) word classes are assigned to each word, In each side, word classes are merged into larger clusters independently according to preceding or following word distributions. This word-clustering can provide more efficient and statistically reliable word clusters. Further, we extend it to a multi-class composite N-gram that unit is a multi-class 2-gram and joined word. The multi-class composite N-gram showed better performance both in perplexity and recognition rates with one thousandth smaller size than conventional word 2-grams.