Large-scale language modeling with random forests for mandarin Chinese speech-to-text

  • Authors:
  • Ilya Oparin;Lori Lamel;Jean-Luc Gauvain

  • Affiliations:
  • LIMSI, CNRS, Orsay cedex, France;LIMSI, CNRS, Orsay cedex, France;LIMSI, CNRS, Orsay cedex, France

  • Venue:
  • IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this work the random forest language modeling approach is applied with the aim of improving the performance of the LIMSI, highly competitive, Mandarin Chinese speech-to-text system. The experimental setup is that of the GALE Phase 4 evaluation. This setup is characterized by a large amount of available language model training data (over 3.2 billion segmented words). A conventional unpruned 4-gram language model with a vocabulary of 56K words serves as a baseline that is challenging to improve upon. However moderate perplexity and CER improvements over this model were obtained with a random forest language model. Different random forest training strategies were explored so as to attain the maximal gain in performance and Forest of Random Forest language modeling scheme is introduced.