Speech recognition by machines and humans
Speech Communication
Fusion of Audio-Visual Information for Integrated Speech Processing
AVBPA '01 Proceedings of the Third International Conference on Audio- and Video-Based Biometric Person Authentication
Integrating audio and visual information to provide highly robust speech recognition
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Audio-visual speech modeling for continuous speech recognition
IEEE Transactions on Multimedia
The Role of Speech Input in Wearable Computing
IEEE Pervasive Computing
Dynamic Bayesian networks for audio-visual speech recognition
EURASIP Journal on Applied Signal Processing
A two-channel training algorithm for hidden Markov model and its application to lip reading
EURASIP Journal on Applied Signal Processing
UAHCI '09 Proceedings of the 5th International Conference on Universal Access in Human-Computer Interaction. Addressing Diversity. Part I: Held as Part of HCI International 2009
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Energetic and informational masking effects in an audiovisual speech recognition system
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Visual features extracting & selecting for lipreading
AVBPA'03 Proceedings of the 4th international conference on Audio- and video-based biometric person authentication
A new probabilistic model for recognizing signs with systematic modulations
AMFG'07 Proceedings of the 3rd international conference on Analysis and modeling of faces and gestures
DBN based models for audio-visual speech analysis and recognition
PCM'06 Proceedings of the 7th Pacific Rim conference on Advances in Multimedia Information Processing
Hi-index | 0.02 |
We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various degrees of audio and visual state-sequence asynchrony. Furthermore, we investigate joint training of all product HMM parameters, instead of just composing the model from separately trained audio- and visual-only HMMs. We report experiments on a multi-subject connected digit recognition task, as well as on a more complex, speaker-independent large-vocabulary dictation task. Our results demonstrate that in both cases, joint multi-stream HMM training is superior to separate training of single-stream HMMs. In addition, we observe that allowing state-sequence asynchrony between the HMM audio and visual components improves connected digit recognition significantly, however it degrades performance on the dictation task. The resulting multi-stream models dramatically improve speech recognition robustness to noise, by successfully exploiting the visual modality speech information: For example, at 11 dB SNR, they reduce connected digit word error rate from the audio-only 2.3% to 0.77% audio-visual, and, for the large-vocabulary task, from 28.3% to 19.5%. Compared to the audio-only performance at 10 dB SNR, the use of multi-stream HMMs achieves an effective SNR gain of up to 9 dB and 7 dB respectively, for the two recognition tasks considered.