Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recognition

  • Authors:
  • Mei Hwang

  • Affiliations:
  • -

  • Venue:
  • Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recognition
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

To model the acoustics of a large vocabulary well while staying within a reasonable memory capacity, most speech recognition systems use phonetic models to share parameters across different words in the vocabulary. This dissertation investigates the merits of modeling at the subphonetic level. We demonstrate that sharing parameters at the subphonetic level provides more accurate acoustic models than sharing at the phonetic level. The concept of subphonetic parameter sharing can be applied to any class of parametric models. Since the first-order hidden Markov model (HMM) has been overwhelmingly successful in speech recognition, this dissertation bases all its studies and experiments on HMMs. The subphonetic unit we investigate is the state of phonetic HMMs. We develop a system in which similar Markov states of phonetic models share the same Markov parameters. The shared parameter (i.e., the output distribution) associated with a cluster of similar states is called a "senone" because of its state dependency. The phonetic models that share senones are shared-distribution models or SDMs. Experiments show that SDMs offer more accurate acoustic models than the generalized-triphone model presented by Lee. Senones are next applied to offer accurate models for triphones not experienced in the system training data. In this dissertation, two approaches for modeling unseen triphones are studied --- purely decision-tree based senones and a hybrid approach using the concept of Markov state quantization. Both approaches indeed offer a significant error reduction over the previously accepted approach of monophone model substitution. However, the purely decision-tree based senone approach is preferred for its simplicity. The concept of Markov state quantization can also be applied to the automatic determination of a senonic baseform. A "senonic baseform" is a word HMM whose output distributions are replaced by the closest senones, resulting in no increase in the the number of parameters in the existing senonic system. Because it is acoustics driven, the senonic baseform is useful for speaker adaptation and learning new words. Finally, we explore relaxation of the mixture-tying constraint in semi-continuous HMMs. We move from a system in which the VQ probability densities are tied across all the Markov states in a system to one in which only similar states share the same densities. To reduce computation in order to make experiments possible, phone-class dependent VQ codebooks are studied. A large suite of experimental results are presented to demonstrate the relative effectiveness of each component of the thesis. After integrating the senonic decision tree with 8 phone-class dependent VQ codebooks into SPHINX-II, we attained a word error rate of 6.7% on the speaker-independent 5,000-word Wall Street Journal continuous-speech recognition task.