Modeling characters versuswords for mandarin speech recognition

Authors:
Jun Luo;Lori Lamel;Jean-Luc Gauvain
Affiliations:
Spoken Language Processing Group, CNRS-LIMSI, BP 133, 91403 Orsay cedex, France;Spoken Language Processing Group, CNRS-LIMSI, BP 133, 91403 Orsay cedex, France;Spoken Language Processing Group, CNRS-LIMSI, BP 133, 91403 Orsay cedex, France
Venue:
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Year:
2009

Citing 0
Cited 3

Large-scale language modeling with random forests for mandarin Chinese speech-to-text

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Language model cross adaptation for LVCSR system combination

Computer Speech and Language
Class-Based language models for chinese-english parallel corpus

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Word based models are widely used in speech recognition since they typically perform well. However, the question of whether it is better to use a word-based or a character-based model warrants being for the Mandarin Chinese language. Since Chinese is written without any spaces or word delimiters, a word segmentation algorithm is applied in a pre-processing step prior to training a word-based language model. Chinese characters carry meaning and speakers are free to combine characters to construct new words. This suggests that character information can also be useful in communication. This paper explores both word-based and character-based models, and their complementarity. Although word-based modeling is found to outperform character-based modeling, increasing the vocabulary size from 56k to 160k words did not lead to a gain in performance. Results are reported for the Gale Mandarin speech-to-text task.