Large-scale language modeling with random forests for mandarin Chinese speech-to-text
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Language model cross adaptation for LVCSR system combination
Computer Speech and Language
Class-Based language models for chinese-english parallel corpus
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Hi-index | 0.00 |
Word based models are widely used in speech recognition since they typically perform well. However, the question of whether it is better to use a word-based or a character-based model warrants being for the Mandarin Chinese language. Since Chinese is written without any spaces or word delimiters, a word segmentation algorithm is applied in a pre-processing step prior to training a word-based language model. Chinese characters carry meaning and speakers are free to combine characters to construct new words. This suggests that character information can also be useful in communication. This paper explores both word-based and character-based models, and their complementarity. Although word-based modeling is found to outperform character-based modeling, increasing the vocabulary size from 56k to 160k words did not lead to a gain in performance. Results are reported for the Gale Mandarin speech-to-text task.