Impact of vocal effort variability on automatic speech recognition

Authors:
Petr Zelinka;Milan Sigmund;Jiri Schimmel
Affiliations:
Department of Radio Electronics, Brno University of Technology, Purkynova 118, 612 00 Brno, Czech Republic;Department of Radio Electronics, Brno University of Technology, Purkynova 118, 612 00 Brno, Czech Republic;Department of Telecommunications, Brno University of Technology, Purkynova 118, 612 00 Brno, Czech Republic
Venue:
Speech Communication
Year:
2012

Citing 6
Cited 1

The influence of acoustics on speech production: a noise-induced stress phenomenon known as the Lombard reflex

Speech Communication - Special issue on speech under stress
Probability Estimates for Multi-class Classification by Pairwise Coupling

The Journal of Machine Learning Research
Discrete-time speech signal processing: principles and practice

Discrete-time speech signal processing: principles and practice
The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise

Speech Communication
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition

Speech Communication

Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

The impact of changes in a speaker's vocal effort on the performance of automatic speech recognition has largely been overlooked by researchers and virtually no speech resources exist for the development and testing of speech recognizers at all vocal effort levels. This study deals with speech properties in the whole range of vocal modes - whispering, soft speech, normal speech, loud speech, and shouting. Fundamental acoustic and phonetic changes are documented. The impact of vocal effort variability on the performance of an isolated-word recognizer is shown and effective means of improving the system's robustness are tested. The proposed multiple model framework approach reaches a 50% relative reduction of word error rate compared to the baseline system. A new specialized speech database, BUT-VE1, is presented, which contains speech recordings of 13 speakers at 5 vocal effort levels with manual phonetic segmentation and sound pressure level calibration.