Audio-visual person authentication using lip-motion from orientation maps
Pattern Recognition Letters
Voiceless speech recognition using dynamic visual speech features
VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition
IEEE Transactions on Computers
Proceedings of the 9th international conference on Multimodal interfaces
Visual recognition of speech consonants using facial movement features
Integrated Computer-Aided Engineering - Informatics in Control, Automation and Robotics
Face active appearance modeling and speech acoustic information to recover articulation
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Lip-synching using speaker-specific articulation, shape and appearance models
EURASIP Journal on Audio, Speech, and Music Processing - Special issue on animating virtual speakers or singers from audio: Lip-synching facial animation
Visual speech recognition using motion features and hidden Markov models
CAIP'07 Proceedings of the 12th international conference on Computer analysis of images and patterns
Some experiments in audio-visual speech processing
NOLISP'07 Proceedings of the 2007 international conference on Advances in nonlinear speech processing
Integration of face detection and user identification with visual speech recognition
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Hi-index | 0.00 |
This paper presents the design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. The audio and visual feature streams are integrated using a segment-constrained hidden Markov model, which allows the visual classifier to process visual frames with a constrained amount of asynchrony relative to proposed acoustic segments. The core experiments in this paper investigate several different visual model structures, each of which provides a different means for defining the units of the visual classifier and the synchrony constraints between the audio and visual streams. Word recognition experiments are conducted on the AV-TIMIT corpus under variable additive noise conditions. Over varying acoustic signal-to-noise ratios, word error rate reductions between 14% and 60% are observed when integrating the visual information into the automatic speech recognition process.