A model-based connected-digit recognition system using either hidden Markov models or templates

  • Authors:
  • L. R. Rabiner;J. G. Wilpon;B. H. Juang

  • Affiliations:
  • -;-;-

  • Venue:
  • Computer Speech and Language
  • Year:
  • 1986

Quantified Score

Hi-index 0.01

Visualization

Abstract

Although a great deal of effort has gone into studying large-vocabulary speech-recognition problems, there remains a number of interesting, and potentially exceedingly important, problems which do not require the complexity of these large systems. One such problem is connected-digit recognition, which has applications to telecommunications, order entry, credit-card entry, forms automation, and data-base management, among others. Connected-digit recognition is also an interesting problem for another reason, namely that it is one in which whole-word training patterns are applicable as the basic speech-recognition unit. Thus one can bring to bear all the fundamental speech recognition technology associated with whole-word recognition to solve this problem. As such, several connected digit recognizers have been proposed in the past few years. The performance of these systems has steadily improved to the point where high digit-recognition accuracy is achievable in a speaker-trained mode. In this paper we present a unified system for automatically recognizing fluently spoken digit strings based on whole-word reference units. The system that we will describe can use either hidden Markov model (HMM) technology or template-based technology. In fact the overall system contains features from both approaches. A key factor in the success of the various connected digit recognizers is the ability to derive, via a training procedure, a good set of representations of the behavior of the individual digits in actual connected digit strings. For most applications, isolated digit training does not provide a good enough characterization of the variability of the digits in strings. The ''best'' training procedure is to derive the digit reference patterns (either templates or statistical models) from connected digit strings. Such a connected word training procedure, based on a segmental k-means loop, has been proposed and was tested on seven experienced users of speech recognizers. For these seven talkers, average string accuracies of greater than 98% for unknown length strings, and greater than 99% for known length strings were obtained on an independent test set of 525 variable length strings (1-7 digits) recorded over local dialed-up telephone lines. To evaluate the performance of the overall connected digit recognizer under more difficult conditions, a set of 50 people (25 men, 25 women), from the non-technical local population, was each asked to record 1200 random digit strings over local dialed-up telephone lines. Both a speaker-trained and a multi-speaker training set was created, and a full performance evaluation was made. Results show that the average string accuracy for unknown- and known-length strings, in the speaker-trained mode, was 98% and 99% respectively; in the multi-speaker mode the average string accuracies were 94% and 96.6% respectively. A complete analysis of these results is given in this paper.