Techniques to achieve an accurate real-time large-vocabulary speech recognition system

  • Authors:
  • Hy Murveit;Peter Monaco;Vassilios Digalakis;John Butzberger

  • Affiliations:
  • SRI International, Menlo Park, California;SRI International, Menlo Park, California;SRI International, Menlo Park, California;SRI International, Menlo Park, California

  • Venue:
  • HLT '94 Proceedings of the workshop on Human Language Technology
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

In addressing the problem of achieving high-accuracy real-time speech recognition systems, we focus on recognizing speech from ARPA's 20,000-word Wall Street Journal (WSJ) task, using current UNIX workstations. We have found that our standard approach---using a narrow beam width in a Viterbi search for simple discrete-density hidden Markov models (HMMs)---works in real time with only very low accuracy. Our most accurate algorithms recognize speech many times slower than real time. Our (yet unattained) goal is to recognize speech in real time at or near full accuracy.We describe the speed/accuracy trade-offs associated with several techniques used in a one-pass speech recognition framework:• Trade-offs associated with reducing the acoustic modeling resolution of the HMMs (e.g., output-distribution type, number of parameters, cross-word modeling)• Trade-offs associated with using lexicon trees, and techniques for implementing full and partial bigram grammars with those trees• Computation of Gaussian probabilities are the most time-consuming aspect of our highest accuracy system, and techniques allowing us to reduce the number of Gaussian probabilities computed with little or no impact on speech recognition accuracy.Our results show that tree-based modeling techniques used with appropriate acoustic modeling approaches achieve real-time performance on current UNIX workstations at about a 30% error rate for the WSJ task. The results also show that we can dramatically reduce the computational complexity of our more accurate but slower modeling alternatives so that they are near the speed necessary for real-time performance in a multipass search. Our near-future goal is to combine these two technologies so that real-time, high-accuracy large-vocabulary speech recognition can be achieved.