Visual model structures and synchrony constraints for audio-visual speech recognition

Authors:
T. J. Hazen
Affiliations:
Comput. Sci. & Artificial Intelligence Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2006

Citing 0
Cited 10

Audio-visual person authentication using lip-motion from orientation maps

Pattern Recognition Letters
Voiceless speech recognition using dynamic visual speech features

VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition

IEEE Transactions on Computers
Temporal filtering of visual speech for audio-visual speech recognition in acoustically and visually challenging environments

Proceedings of the 9th international conference on Multimodal interfaces
Visual recognition of speech consonants using facial movement features

Integrated Computer-Aided Engineering - Informatics in Control, Automation and Robotics
Face active appearance modeling and speech acoustic information to recover articulation

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Lip-synching using speaker-specific articulation, shape and appearance models

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation
Visual speech recognition using motion features and hidden Markov models

CAIP'07 Proceedings of the 12th international conference on Computer analysis of images and patterns
Some experiments in audio-visual speech processing

NOLISP'07 Proceedings of the 2007 international conference on Advances in nonlinear speech processing
Integration of face detection and user identification with visual speech recognition

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. The audio and visual feature streams are integrated using a segment-constrained hidden Markov model, which allows the visual classifier to process visual frames with a constrained amount of asynchrony relative to proposed acoustic segments. The core experiments in this paper investigate several different visual model structures, each of which provides a different means for defining the units of the visual classifier and the synchrony constraints between the audio and visual streams. Word recognition experiments are conducted on the AV-TIMIT corpus under variable additive noise conditions. Over varying acoustic signal-to-noise ratios, word error rate reductions between 14% and 60% are observed when integrating the visual information into the automatic speech recognition process.