Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environments

Authors:
Hynek Bořil;John H. L. Hansen
Affiliations:
Center for Robust Speech Systems, The University of Texas at Dallas, Richardson, TX;Center for Robust Speech Systems, The University of Texas at Dallas, Richardson, TX
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 10
Cited 2

Classification of speech under stress using target driven features

Speech Communication - Special issue on speech under stress
Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition

Speech Communication - Special issue on speech under stress
Speaker Normalization Based on Frequency Warping

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Analysis and compensation of stressed and noisy speech with application to robust automatic recognition

Analysis and compensation of stressed and noisy speech with application to robust automatic recognition
A parametric approach to vocal tract length normalization

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Speaker normalization using efficient frequency warping procedures

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition

Speech Communication
Unsupervised equalization of Lombard effect for speech recognition in noisy adverse environment

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Analysis and compensation of Lombard speech across noise type and levels with application to in-set/out-of-set speaker recognition

IEEE Transactions on Audio, Speech, and Language Processing
Quantile based histogram equalization for noise robust large vocabulary speech recognition

IEEE Transactions on Audio, Speech, and Language Processing

Spectral histogram of oriented gradients (SHOGs) for Tamil language male/female speaker classification

International Journal of Speech Technology
Maximum Likelihood Acoustic Factor Analysis Models for Robust Speaker Verification in Noise

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the presence of environmental noise, speakers tend to adjust their speech production in an effort to preserve intelligible communication. The noise-induced speech adjustments, called Lombard effect (LE), are known to severely impact the accuracy of automatic speech recognition (ASR) systems. The reduced performance results from the mismatch between the ASR acoustic models trained typically on noise-clean neutral (modal) speech and the actual parameters of noisy LE speech. In this study, novel unsupervised frequency domain and cepstral domain equalizations that increase ASR resistance to LE are proposed and incorporated in a recognition scheme employing a codebook of noisy acoustic models. In the frequency domain, short-time speech spectra are transformed towards neutral ASR acoustic models in a maximum-likelihood fashion. Simultaneously, dynamics of cepstral samples are determined from the quantile estimates and normalized to a constant range. A codebook decoding strategy is applied to determine the noisy models best matching the actual mixture of speech and noisy background. The proposed algorithms are evaluated side by side with conventional compensation schemes on connected Czech digits presented in various levels of background car noise. The resulting system provides an absolute word error rate (WER) reduction on lO-dB signal-to-noise ratio data of 8.7% and 37.7 % for female neutral and LE speech, respectively, and of 8.7% and 32.8% for male neutral and LE speech, respectively, when compared to the baseline recognizer employing perceptual linear prediction (PLP) coefficients and cepstral mean and variance normalization.