Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition

Authors:
Bernd T. Meyer;Birger Kollmeier
Affiliations:
Medical Physics, Carl-von-Ossietzky Universität Oldenburg, D-26111 Oldenburg, Germany;Medical Physics, Carl-von-Ossietzky Universität Oldenburg, D-26111 Oldenburg, Germany
Venue:
Speech Communication
Year:
2011

Citing 5
Cited 2

Recognition of isolated words based on psychoacoustics and neurobiology

Speech Communication - Neurospeech
Speech recognition by machines and humans

Speech Communication
Should recognizers have ears?

Speech Communication - Special issue on robust speech recognition
Reaching over the gap: A review of efforts to link human and automatic speech recognition research

Speech Communication
Temporal patterns (TRAPs) in ASR of noisy speech

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

Impact of vocal effort variability on automatic speech recognition

Speech Communication
A clustering based feature selection method in spectro-temporal domain for speech recognition

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The effect of bio-inspired spectro-temporal processing for automatic speech recognition (ASR) is analyzed for two different tasks with focus on the robustness of spectro-temporal Gabor features in comparison to mel-frequency cepstral coefficients (MFCCs). Experiments aiming at extrinsic factors such as additive noise and changes of the transmission channel were carried out on a digit classification task (AURORA 2) for which spectro-temporal features were found to be more robust than the MFCC baseline against a wide range of noise sources. Intrinsic variations, i.e., changes in speaking rate, speaking effort and pitch, were analyzed on a phoneme recognition task with matched training and test conditions. The sensitivity of Gabor and MFCC features against various speaking styles was found to be different in a systematic way. An analysis based on phoneme confusions for both feature types suggests that spectro-temporal and purely spectral features carry complementary information. The usefulness of the combined information was demonstrated in a system using a combination of both types of features which yields a decrease in word-error rate of 16% compared to the best single-stream recognizer and 47% compared to an MFCC baseline.