Why is the recognition of spontaneous speech so hard?

Authors:
Sadaoki Furui;Masanobu Nakamura;Tomohisa Ichiba;Koji Iwano
Affiliations:
Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan
Venue:
TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Year:
2005

Citing 2
Cited 2

An acoustic description of consonant reduction

Speech Communication
Improved modeling and efficiency for automatic transcription of Broadcast News

Speech Communication - Special issue on automatic transcription of broadcast news data

Transcription of Catalan Broadcast Conversation

TSD '09 Proceedings of the 12th International Conference on Text, Speech and Dialogue
Instructing people for training gestural interactive systems

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although speech, derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper reports analysis and recognition of spontaneous speech using a large-scale spontaneous speech database “Corpus of Spontaneous Japanese (CSJ)”. Recognition results in this experiment show that recognition accuracy significantly increases as a function of the size of acoustic as well as language model training data and the improvement levels off at approximately 7M words of training data. This means that acoustic and linguistic variation of spontaneous speech is so large that we need a very large corpus in order to encompass the variations. Spectral analysis using various styles of utterances in the CSJ shows that the spectral distribution/difference of phonemes is significantly reduced in spontaneous speech compared to read speech. Experimental results also show that there is a strong correlation between mean spectral distance between phonemes and phoneme recognition accuracy. This indicates that spectral reduction is one major reason for the decrease of recognition accuracy of spontaneous speech.