Effect of acoustic and linguistic contexts on human and machine speech recognition

Authors:
Norihide Kitaoka;Daisuke Enami;Seiichi Nakagawa
Affiliations:
-;-;-
Venue:
Computer Speech and Language
Year:
2014

Citing 5
Cited 0

Speech recognition by machines and humans

Speech Communication
Reaching over the gap: A review of efforts to link human and automatic speech recognition research

Speech Communication
Online Speech Detection and Dual-Gender Speech Recognition for Captioning Broadcast News

IEICE - Transactions on Information and Systems
Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system

IEEE Transactions on Audio, Speech, and Language Processing
A convergent gambling estimate of the entropy of English

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We compared the performance of an automatic speech recognition system using n-gram language models, HMM acoustic models, as well as combinations of the two, with the word recognition performance of human subjects who either had access to only acoustic information, had information only about local linguistic context, or had access to a combination of both. All speech recordings used were taken from Japanese narration and spontaneous speech corpora. Humans have difficulty recognizing isolated words taken out of context, especially when taken from spontaneous speech, partly due to word-boundary coarticulation. Our recognition performance improves dramatically when one or two preceding words are added. Short words in Japanese mainly consist of post-positional particles (i.e. wa, ga, wo, ni, etc.), which are function words located just after content words such as nouns and verbs. So the predictability of short words is very high within the context of the one or two preceding words, and thus recognition of short words is drastically improved. Providing even more context further improves human prediction performance under text-only conditions (without acoustic signals). It also improves speech recognition, but the improvement is relatively small. Recognition experiments using an automatic speech recognizer were conducted under conditions almost identical to the experiments with humans. The performance of the acoustic models without any language model, or with only a unigram language model, were greatly inferior to human recognition performance with no context. In contrast, prediction performance using a trigram language model was superior or comparable to human performance when given a preceding and a succeeding word. These results suggest that we must improve our acoustic models rather than our language models to make automatic speech recognizers comparable to humans in recognition performance under conditions where the recognizer has limited linguistic context.