Speaker-independent phoneme alignment using transition-dependent states

  • Authors:
  • John-Paul Hosom

  • Affiliations:
  • Center for Spoken Language Understanding, School of Science and Engineering, Oregon Health and Science University, 20000 NW Walker Road, Beaverton, OR 97006, USA

  • Venue:
  • Speech Communication
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Determining the location of phonemes is important to a number of speech applications, including training of automatic speech recognition systems, building text-to-speech systems, and research on human speech processing. Agreement of humans on the location of phonemes is, on average, 93.78% within 20ms on a variety of corpora, and 93.49% within 20ms on the TIMIT corpus. We describe a baseline forced-alignment system and a proposed system with several modifications to this baseline. Modifications include the addition of energy-based features to the standard cepstral feature set, the use of probabilities of a state transition given an observation, and the computation of probabilities of distinctive phonetic features instead of phoneme-level probabilities. Performance of the baseline system on the test partition of the TIMIT corpus is 91.48% within 20ms, and performance of the proposed system on this corpus is 93.36% within 20ms. The results of the proposed system are a 22% relative reduction in error over the baseline system, and a 14% reduction in error over results from a non-HMM alignment system. This result of 93.36% agreement is the best known reported result on the TIMIT corpus.