Speaker-independent phoneme alignment using transition-dependent states

Authors:
John-Paul Hosom
Affiliations:
Center for Spoken Language Understanding, School of Science and Engineering, Oregon Health and Science University, 20000 NW Walker Road, Beaverton, OR 97006, USA
Venue:
Speech Communication
Year:
2009

Citing 7
Cited 4

Continuously variable duration hidden Markov models for automatic speech recognition

Computer Speech and Language
Neural nets and hidden Markov models: review and generalizations

Speech Communication - Eurospeech '91
Fundamentals of speech recognition

Fundamentals of speech recognition
Automatic segmentation and labeling of speech based on Hidden Markov Models

Speech Communication
Enhancement, segmentation, and synthesis of speech with application to robust speaker recognition

Enhancement, segmentation, and synthesis of speech with application to robust speaker recognition
Automatic segmentation and labeling of speech

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Improving the intelligibility of dysarthric speech

Speech Communication

Adaptive phoneme alignment based on rough set theory

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
Improving articulatory feature and phoneme recognition using multitask learning

ICANN'11 Proceedings of the 21th international conference on Artificial neural networks - Volume Part I
On split Dynamic Time Warping for robust Automatic Dialogue Replacement

Signal Processing
Determining the relevance of different aspects of formant contours to intelligibility

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Determining the location of phonemes is important to a number of speech applications, including training of automatic speech recognition systems, building text-to-speech systems, and research on human speech processing. Agreement of humans on the location of phonemes is, on average, 93.78% within 20ms on a variety of corpora, and 93.49% within 20ms on the TIMIT corpus. We describe a baseline forced-alignment system and a proposed system with several modifications to this baseline. Modifications include the addition of energy-based features to the standard cepstral feature set, the use of probabilities of a state transition given an observation, and the computation of probabilities of distinctive phonetic features instead of phoneme-level probabilities. Performance of the baseline system on the test partition of the TIMIT corpus is 91.48% within 20ms, and performance of the proposed system on this corpus is 93.36% within 20ms. The results of the proposed system are a 22% relative reduction in error over the baseline system, and a 14% reduction in error over results from a non-HMM alignment system. This result of 93.36% agreement is the best known reported result on the TIMIT corpus.