Visual tracking for multimodal human computer interaction
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Models for Audiovisual Fusion in a Noisy-Vowel Recognition Task
Journal of VLSI Signal Processing Systems - special issue on multimedia signal processing
Lip feature extraction using red exclusion
VIP '00 Selected papers from the Pan-Sydney workshop on Visualisation - Volume 2
Audio-visual speech recognition using red exclusion and neural networks
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
The human-computer interaction handbook
Sensor fusion weighting measures in Audio-Visual Speech Recognition
ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
MMUI '05 Proceedings of the 2005 NICTA-HCSNet Multimodal User Interaction Workshop - Volume 57
Robust face-voice based speaker identity verification using multilevel fusion
Image and Vision Computing
HCI Beyond the GUI: Design for Haptic, Speech, Olfactory, and Other Nontraditional Interfaces
HCI Beyond the GUI: Design for Haptic, Speech, Olfactory, and Other Nontraditional Interfaces
Reliability score based multimodal fusion for biometric person authentication
MATH'08 Proceedings of the American Conference on Applied Mathematics
Spoken Word Recognition from Side of Face Using Infrared Lip Movement Sensor
PIT '08 Proceedings of the 4th IEEE tutorial and research workshop on Perception and Interactive Technologies for Speech-Based Systems: Perception in Multimodal Dialogue Systems
Hi-index | 0.00 |
We present work on improving the performance of automated speech recognizers by using additional visual information: (lip-/speechreading); achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition performance. We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. We have developed adaptive combination methods at several levels of the recognition network. Additional information such as estimated signal-to-noise ratio (SNR) is used in some cases. The results of the different combination methods are shown for clean speech and data with artificial noise (white, music, motor). The new combination methods adapt automatically to varying noise conditions making hand-tuned parameters unnecessary.