Fusing data streams in continuous audio-visual speech recognition

Authors:
Leon J. M. Rothkrantz;Jacek C. Wojdeł;Pascal Wiggers
Affiliations:
Man–Machine Interaction Group, Delft University of Technology, Delft, The Netherlands;Man–Machine Interaction Group, Delft University of Technology, Delft, The Netherlands;Man–Machine Interaction Group, Delft University of Technology, Delft, The Netherlands
Venue:
TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Year:
2005

Citing 2
Cited 0

Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Journal of VLSI Signal Processing Systems
Dynamic Bayesian networks for audio-visual speech recognition

EURASIP Journal on Applied Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speech recognition still lacks robustness when faced with changing noise characteristics. Automatic lip reading on the other hand is not affected by acoustic noise and can therefore provide the speech recognizer with valuable additional information, especially since the visual modality contains information that is complementary to information in the audio channel. In this paper we present a novel way of processing the video signal for lip reading and a post-processing data transformation that can be used alongside it. The presented Lip Geometry Estimation (LGE) is compared with other geometry- and image intensity-based techniques typically deployed for this task. A large vocabulary continuous audio-visual speech recognizer for Dutch using this method has been implemented. We show that a combined system improves upon audio-only recognition in the presence of noise.