Integrating audio and visual information to provide highly robust speech recognition

Authors:
M. J. Tomlinson;M. J. Russell;N. M. Brooke
Affiliations:
Speech Res. Unit, DRA, Malvern, UK;-;-
Venue:
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Year:
1996

Citing 0
Cited 18

Extraction of Visual Features for Lipreading

IEEE Transactions on Pattern Analysis and Machine Intelligence
Designing robust multimodal systems for universal access

WUAUC'01 Proceedings of the 2001 EC/NSF workshop on Universal accessibility of ubiquitous computing: providing for the elderly
Fusion of Audio-Visual Information for Integrated Speech Processing

AVBPA '01 Proceedings of the Third International Conference on Audio- and Video-Based Biometric Person Authentication
Multimodal interfaces

The human-computer interaction handbook
Advances in the robust processing of multimodal speech and pen systems

Multimodal interface for human-machine communication
Multi-Modal Temporal Asynchronicity Modeling by Product HMMs for Robust

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
Advances in Robust Multimodal Interface Design

IEEE Computer Graphics and Applications
Dynamic Bayesian networks for audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
A two-channel training algorithm for hidden Markov model and its application to lip reading

EURASIP Journal on Applied Signal Processing
Audio-visual speech recognition using lip information extracted from side-face images

EURASIP Journal on Audio, Speech, and Music Processing
Asynchrony modeling for audio-visual speech recognition

HLT '02 Proceedings of the second international conference on Human Language Technology Research
HCI Beyond the GUI: Design for Haptic, Speech, Olfactory, and Other Nontraditional Interfaces

HCI Beyond the GUI: Design for Haptic, Speech, Olfactory, and Other Nontraditional Interfaces
Audio-visual speaker identification using dynamic facial movements and utterance phonetic content

Applied Soft Computing
Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge

Speech Communication
Audio-visual speech recognition based on AAM parameter and phoneme analysis of visual feature

PSIVT'11 Proceedings of the 5th Pacific Rim conference on Advances in Image and Video Technology - Volume Part I
Lipreading procedure for liveness verification in video authentication systems

HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part I
Robust AAM-based audio-visual speech recognition against face direction changes

Proceedings of the 20th ACM international conference on Multimedia
Multiple cameras for audio-visual speech recognition in an automotive environment

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a requirement in many human machine interactions to provide accurate automatic speech recognition in the presence of high levels of interfering noise. The the paper shows that performance improvements in recognition accuracy can be obtained by including data derived from a speaker's lip images. We describe the combination of the audio and visual data in the construction of composite feature vectors and a hidden Markov model structure which allows for asynchrony between the audio and visual components. These ideas are applied to a speaker dependent recognition task involving a small vocabulary and subject to interfering noise. The recognition results obtained using composite vectors and cross-product models are compared with those based on an audio-only feature vector. The benefit of this approach is shown to be an increased performance over a very wide range of noise levels.