Large-scale language modeling with random forests for mandarin Chinese speech-to-text

Authors:
Ilya Oparin;Lori Lamel;Jean-Luc Gauvain
Affiliations:
LIMSI, CNRS, Orsay cedex, France;LIMSI, CNRS, Orsay cedex, France;LIMSI, CNRS, Orsay cedex, France
Venue:
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Year:
2010

Citing 6
Cited 0

A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
The LIMSI Broadcast News transcription system

Speech Communication - Special issue on automatic transcription of broadcast news data
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Training neural network language models on very large corpora

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Modeling characters versuswords for mandarin speech recognition

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Knowledge integration into language models: a random forest approach

Knowledge integration into language models: a random forest approach

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work the random forest language modeling approach is applied with the aim of improving the performance of the LIMSI, highly competitive, Mandarin Chinese speech-to-text system. The experimental setup is that of the GALE Phase 4 evaluation. This setup is characterized by a large amount of available language model training data (over 3.2 billion segmented words). A conventional unpruned 4-gram language model with a vocabulary of 56K words serves as a baseline that is challenging to improve upon. However moderate perplexity and CER improvements over this model were obtained with a random forest language model. Different random forest training strategies were explored so as to attain the maximal gain in performance and Forest of Random Forest language modeling scheme is introduced.