Towards an intelligent acoustic front end for automatic speech recognition: built-in speaker normalization

Authors:
Umit H. Yapanel;John H. L. Hansen
Affiliations:
Center for Robust Speech Systems, Deparment of Electrical Engineering, University of Texas at Dallas, Richardson, TX;Center for Robust Speech Systems, Deparment of Electrical Engineering, University of Texas at Dallas, Richardson, TX
Venue:
EURASIP Journal on Audio, Speech, and Music Processing - Intelligent Audio, Speech, and Music Processing Applications
Year:
2008

Citing 12
Cited 0

Discrete-time signal processing

Discrete-time signal processing
Acoustical and environmental robustness in automatic speech recognition

Acoustical and environmental robustness in automatic speech recognition
Adaptive filter theory (2nd ed.)

Adaptive filter theory (2nd ed.)
Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition

Speech Communication - Special issue on speech under stress
Speaker Normalization Based on Frequency Warping

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Signal modeling for robust speech recognition with frequency warping and convex optimization

Signal modeling for robust speech recognition with frequency warping and convex optimization
Acoustic modeling and speaker normalization strategies with application to robust in-vehicle speech recognition and dialect classification

Acoustic modeling and speaker normalization strategies with application to robust in-vehicle speech recognition and dialect classification
A parametric approach to vocal tract length normalization

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Speaker normalization using efficient frequency warping procedures

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Investigations on inter-speaker variability in the feature space

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
The 1998 HTK system for transcription of conversational telephone speech

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

A proven method for achieving effective automatic speech recognition (ASR) due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN), despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN), where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i) an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER) by 24%, and (ii) for a diverse noisy speech task (SPINE 2), where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.