Multilingual stochastic n-gram class language models

Authors:
M. Jardino
Affiliations:
Lab. d'Inf. pour la Mecanique et les Sci. de l'Ingenieur, CNRS, Orsay, France
Venue:
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Year:
1996

Citing 0
Cited 2

Open-domain voice-activated question answering

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stochastic language models are widely used in continuous speech recognition systems where a priori probabilities of word sequences are needed. These probabilities are usually given by n-gram word models, estimated on very large training texts. When n increases, it becomes harder to find reliable statistics, even with huge texts. Grouping words is a way to overcome this problem. We have developed an automatic language independent classification procedure, which is able to optimize the classification of tens of millions of untagged words in less than a few hours on a Unix workstation. With this language independent approach, three corpora each containing about 30 million words of newspaper texts, in French, German and English, have been mapped into different numbers of classes. From these classifications, bi-gram and tri-gram class language models have been built. The perplexities of held-out test texts have been assessed, showing that tri-gram class models give lower values than those obtained with tri-gram word models, for the three languages.