Multilingual stochastic n-gram class language models

  • Authors:
  • M. Jardino

  • Affiliations:
  • Lab. d'Inf. pour la Mecanique et les Sci. de l'Ingenieur, CNRS, Orsay, France

  • Venue:
  • ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stochastic language models are widely used in continuous speech recognition systems where a priori probabilities of word sequences are needed. These probabilities are usually given by n-gram word models, estimated on very large training texts. When n increases, it becomes harder to find reliable statistics, even with huge texts. Grouping words is a way to overcome this problem. We have developed an automatic language independent classification procedure, which is able to optimize the classification of tens of millions of untagged words in less than a few hours on a Unix workstation. With this language independent approach, three corpora each containing about 30 million words of newspaper texts, in French, German and English, have been mapped into different numbers of classes. From these classifications, bi-gram and tri-gram class language models have been built. The perplexities of held-out test texts have been assessed, showing that tri-gram class models give lower values than those obtained with tri-gram word models, for the three languages.