The estimation of powerful language models from small and large corpora

  • Authors:
  • Paul Placeway;Richard Schwartz;Pascale Fung;Long Nguyen

  • Affiliations:
  • Bolt Beranek and Newman Inc., Cambridge, MA;Bolt Beranek and Newman Inc., Cambridge, MA;Computer Science Department, Columbia University, New York, NY and Bolt Beranek and Newman Inc., Cambridge, MA;Bolt Beranek and Newman Inc., Cambridge, MA

  • Venue:
  • ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
  • Year:
  • 1993

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper deals with the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domain-dependent data. We begin with an improved modeling of the grammar statistics, based on a combination of the backing-off technique [6] and zero-frequency techniques [2, 9]. These are extended to be more amenable to our particular system. Our resulting technique is greatly simplified, more robust, and gives improved recognition performance than either of the previous techniques. We then further attack the problem of robustness of a model based on a small training corpus by grouping words into obvious semantic classes. This significantly improves the robustness of the resulting statistical grammar. We also present a technique that allows the estimation of a high-order model on modest computation resources. This allows us to run a 4-gram statistical model of a 50 million word corpus on a workstation of only modest capability and cost. Finally, we discuss results from applying a 2-gram statistical language model integrated in the HMM search, obtaining a list of the N-Best recognition results, and rescoring this list with a higher-order statistical model.