Consistent modeling of the static and time-derivative cepstrums for speech recognition using HSPTM

Authors:
Yiu-Pong Lai;Man-Hung Siu
Affiliations:
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong;Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong
Venue:
ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Year:
2006

Citing 4
Cited 0

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Model Parameter Estimation for Mixture Density Polynomial Segment Models

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
A segmental speech model with applications to word spotting

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
A bidirectional target-filtering model of speech coarticulation and reduction: two-stage implementation for phonetic recognition

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most speech models represent the static and derivative cepstral features with separate models that can be inconsistent with each other. In our previous work, we proposed the hidden spectral peak trajectory model (HSPTM) in which the static cepstral trajectories are derived from a set of hidden trajectories of the spectral peaks (captured as spectral poles) in the time-frequency domain. In this work, the HSPTM is generalized such that both the static and derivative features are derived from a single set of hidden pole trajectories using the well-known relationship between the spectral poles and cepstral coefficients. As the pole trajectories represent the resonance frequencies across time, they can be interpreted as formant tracks in voiced speech which have been shown to contain important cues for phonemic identification. To preserve the common recognition framework, the likelihood functions are still defined in the cepstral domain with the acoustic models defined by the static and derivative cepstral trajectories. However, these trajectories are no longer separately estimated but jointly derived, and thus are ensured to be consistent with each other. Vowel classification experiments were performed on the TIMIT corpus, using low complexity models (2-mixture). They showed 3% (absolute) classification error reduction compared to the standard HMM of the same complexity.