Asynchrony modeling for audio-visual speech recognition

Authors:
Guillaume Gravier;Gerasimos Potamianos;Chalapathy Neti
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Venue:
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Year:
2002

Citing 4
Cited 10

Speech recognition by machines and humans

Speech Communication
Fusion of Audio-Visual Information for Integrated Speech Processing

AVBPA '01 Proceedings of the Third International Conference on Audio- and Video-Based Biometric Person Authentication
Integrating audio and visual information to provide highly robust speech recognition

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia

The Role of Speech Input in Wearable Computing

IEEE Pervasive Computing
Dynamic Bayesian networks for audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
A two-channel training algorithm for hidden Markov model and its application to lip reading

EURASIP Journal on Applied Signal Processing
Sign Language Recognition, Generation, and Modelling: A Research Effort with Applications in Deaf Communication

UAHCI '09 Proceedings of the 5th International Conference on Universal Access in Human-Computer Interaction. Addressing Diversity. Part I: Held as Part of HCI International 2009
Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Energetic and informational masking effects in an audiovisual speech recognition system

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

Speech Communication
Visual features extracting & selecting for lipreading

AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
A new probabilistic model for recognizing signs with systematic modulations

AMFG'07 Proceedings of the 3rd international conference on Analysis and modeling of faces and gestures
DBN based models for audio-visual speech analysis and recognition

PCM'06 Proceedings of the 7th Pacific Rim conference on Advances in Multimedia Information Processing

Quantified Score

Hi-index	0.02

Visualization

Abstract

We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various degrees of audio and visual state-sequence asynchrony. Furthermore, we investigate joint training of all product HMM parameters, instead of just composing the model from separately trained audio- and visual-only HMMs. We report experiments on a multi-subject connected digit recognition task, as well as on a more complex, speaker-independent large-vocabulary dictation task. Our results demonstrate that in both cases, joint multi-stream HMM training is superior to separate training of single-stream HMMs. In addition, we observe that allowing state-sequence asynchrony between the HMM audio and visual components improves connected digit recognition significantly, however it degrades performance on the dictation task. The resulting multi-stream models dramatically improve speech recognition robustness to noise, by successfully exploiting the visual modality speech information: For example, at 11 dB SNR, they reduce connected digit word error rate from the audio-only 2.3% to 0.77% audio-visual, and, for the large-vocabulary task, from 28.3% to 19.5%. Compared to the audio-only performance at 10 dB SNR, the use of multi-stream HMMs achieves an effective SNR gain of up to 9 dB and 7 dB respectively, for the two recognition tasks considered.