A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
The LIMSI Broadcast News transcription system
Speech Communication - Special issue on automatic transcription of broadcast news data
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Training neural network language models on very large corpora
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Modeling characters versuswords for mandarin speech recognition
ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Knowledge integration into language models: a random forest approach
Knowledge integration into language models: a random forest approach
Hi-index | 0.00 |
In this work the random forest language modeling approach is applied with the aim of improving the performance of the LIMSI, highly competitive, Mandarin Chinese speech-to-text system. The experimental setup is that of the GALE Phase 4 evaluation. This setup is characterized by a large amount of available language model training data (over 3.2 billion segmented words). A conventional unpruned 4-gram language model with a vocabulary of 56K words serves as a baseline that is challenging to improve upon. However moderate perplexity and CER improvements over this model were obtained with a random forest language model. Different random forest training strategies were explored so as to attain the maximal gain in performance and Forest of Random Forest language modeling scheme is introduced.