The estimation of powerful language models from small and large corpora

Authors:
Paul Placeway;Richard Schwartz;Pascale Fung;Long Nguyen
Affiliations:
Bolt Beranek and Newman Inc., Cambridge, MA;Bolt Beranek and Newman Inc., Cambridge, MA;Computer Science Department, Columbia University, New York, NY and Bolt Beranek and Newman Inc., Cambridge, MA;Bolt Beranek and Newman Inc., Cambridge, MA
Venue:
ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
Year:
1993

Citing 6
Cited 5

Text compression

Text compression
Toward a real-time spoken language system using commercial hardware

HLT '90 Proceedings of the workshop on Speech and Natural Language
A proposal for SLS evaluation

HLT '89 Proceedings of the workshop on Speech and Natural Language
A simple statistical class grammar for measuring speech recognition performance

HLT '89 Proceedings of the workshop on Speech and Natural Language
New uses for the N-best sentence hypotheses within the BYBLOS speech recognition system

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1
Cooccurrence smoothing for stochastic language modeling

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1

Hidden understanding models of natural language

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A fully statistical approach to natural language interfaces

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Comparative experiments on large vocabulary speech recognition

HLT '93 Proceedings of the workshop on Human Language Technology
Statistical language processing using hidden understanding models

HLT '94 Proceedings of the workshop on Human Language Technology
The BBN/HARC spoken language understanding system

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domain-dependent data. We begin with an improved modeling of the grammar statistics, based on a combination of the backing-off technique [6] and zero-frequency techniques [2, 9]. These are extended to be more amenable to our particular system. Our resulting technique is greatly simplified, more robust, and gives improved recognition performance than either of the previous techniques. We then further attack the problem of robustness of a model based on a small training corpus by grouping words into obvious semantic classes. This significantly improves the robustness of the resulting statistical grammar. We also present a technique that allows the estimation of a high-order model on modest computation resources. This allows us to run a 4-gram statistical model of a 50 million word corpus on a workstation of only modest capability and cost. Finally, we discuss results from applying a 2-gram statistical language model integrated in the HMM search, obtaining a list of the N-Best recognition results, and rescoring this list with a higher-order statistical model.