Fusing data streams in continuous audio-visual speech recognition

  • Authors:
  • Leon J. M. Rothkrantz;Jacek C. Wojdeł;Pascal Wiggers

  • Affiliations:
  • Man–Machine Interaction Group, Delft University of Technology, Delft, The Netherlands;Man–Machine Interaction Group, Delft University of Technology, Delft, The Netherlands;Man–Machine Interaction Group, Delft University of Technology, Delft, The Netherlands

  • Venue:
  • TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Speech recognition still lacks robustness when faced with changing noise characteristics. Automatic lip reading on the other hand is not affected by acoustic noise and can therefore provide the speech recognizer with valuable additional information, especially since the visual modality contains information that is complementary to information in the audio channel. In this paper we present a novel way of processing the video signal for lip reading and a post-processing data transformation that can be used alongside it. The presented Lip Geometry Estimation (LGE) is compared with other geometry- and image intensity-based techniques typically deployed for this task. A large vocabulary continuous audio-visual speech recognizer for Dutch using this method has been implemented. We show that a combined system improves upon audio-only recognition in the presence of noise.