A Robust Bimodal Speech Section Detection

Authors:
Kazumasa Murai;Satoshi Nakamura
Affiliations:
ATR Spoken Language Telecommunications Research Laboratories, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0288, Japan;ATR Spoken Language Telecommunications Research Laboratories, 2-2-2 Hikaridai Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
Venue:
Journal of VLSI Signal Processing Systems
Year:
2004

Citing 1
Cited 0

Neural Network-Based Face Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses robust speech section detection by audio and video modalities. Most of today's speech recognition systems require speech section detection prior to any further analysis, and the accuracy of detected speech section s is said to affect the speech recognition accuracy. Because audio modalities are intrinsically disturbed by audio noise, we have been researching video modality speech section detection by detecting deformations in speech organ images. Video modalities are robust to audio noise, but their detection sections are longer than audio speech sections because deformations in related organs start before the speech to prepare for the articulation of the first phoneme, and also because the settling down motion lasts longer than the speech. We have verified that inaccurate detected sections caused by this excess length degrade the speech recognition rate, leading to speech recognition errors by insertions. To reduce insertion errors, and enhance the robustness of speech detection, we propose a method that takes advantage of the two types of modalities. According to our experiment, the proposed method is confirmed to reduce the insertion error rate as well as increase the recognition rate in noisy environment.