Intelligibility rating with automatic speech recognition, prosodic, and cepstral evaluation

Authors:
Tino Haderlein;Cornelia Moers;Bernd Möbius;Frank Rosanowski;Elmar Nöth
Affiliations:
University of Erlangen-Nuremberg, Pattern Recognition Lab, Informatik 5, Erlangen, Germany and Department of Phoniatrics and Pedaudiology;University of Bonn, Department of Speech and Communication, Bonn, Germany;Saarland University, Department of Computational Linguistics and Phonetics, Saarbrücken, Germany;University of Erlangen-Nuremberg, Department of Phoniatrics and Pedaudiology, Erlangen, Germany;University of Erlangen-Nuremberg, Pattern Recognition Lab, Informatik 5, Erlangen, Germany
Venue:
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Year:
2011

Citing 3
Cited 0

A tutorial on support vector regression

Statistics and Computing
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Prosody dependent speech recognition on radio news corpus of American English

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

For voice rehabilitation, speech intelligibility is an important criterion. Automatic evaluation of intelligibility has been shown to be successful for automatic speech recognition methods combined with prosodic analysis. In this paper, this method is extended by using measures based on the Cepstral Peak Prominence (CPP). 73 hoarse patients (48.3±16.8 years) uttered the vowel /e/ and read the German version of the text "The North Wind and the Sun". Their intelligibility was evaluated perceptually by 5 speech therapists and physicians according to a 5-point scale. Support Vector Regression (SVR) revealed a feature set with a human-machine correlation of r = 0.85 consisting of the word accuracy, smoothed CPP computed from a speech section, and three prosodic features (normalized energy of word-pause-word intervals, F0 value at voice offset in a word, and standard deviation of jitter). The average human-human correlation was r = 0.82. Hence, the automatic method can be a meaningful objective support for perceptual analysis.