Robust Sensor Fusion: Analysis and Application to Audio Visual Speech Recognition

Authors:
Javier R. Movellan;Paul Mineiro
Affiliations:
Department of Cognitive Science, University of California San Diego, La Jolla, California CA 92093-0515. E-mail: {movellan,pmineiro}@cogsci.ucsd.edu;Department of Cognitive Science, University of California San Diego, La Jolla, California CA 92093-0515. E-mail: {movellan,pmineiro}@cogsci.ucsd.edu
Venue:
Machine Learning - Special issue on context sensitivity and concept drift
Year:
1998

Citing 3
Cited 4

Neural network vowel-recognition jointly using voice features and mouth shape image

Pattern Recognition
Bayesian Learning for Neural Networks

Bayesian Learning for Neural Networks
Data Fusion for Sensory Information Processing Systems

Data Fusion for Sensory Information Processing Systems

Lip feature extraction using red exclusion

VIP '00 Selected papers from the Pan-Sydney workshop on Visualisation - Volume 2
Audio-visual speech recognition using red exclusion and neural networks

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Sensor fusion weighting measures in Audio-Visual Speech Recognition

ACSC '04 Proceedings of the 27th Australasian conference on Computer science - Volume 26
Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper analyzes the issue of catastrophic fusion, a problem thatoccurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. Forconcreteness we frame the analysis with regard to the problem of automaticaudio visual speech recognition (AVSR), but the issues at hand are verygeneral and arise in multimodal recognition systems which need to work in awide variety of contexts. Catastrophic fusion is said to have occurred whenthe performance of a multimodal system is inferior to the performance ofsome isolated modules, e.g., when the performance of the audio visualspeech recognition system is inferior to that of the audio system alone.Catastrophic fusion arises because recognition modules make implicitassumptions and thus operate correctly only within a certain context.Practice shows that when modules are tested in contexts inconsistent withtheir assumptions, their influence on the fused product tends to increase,with catastrophic results. We propose a principled solution to this problembased upon Bayesian ideas of competitive models and inferencerobustification. We study the approach analytically on a classic Gaussiandiscrimination task and then apply it to a realistic problem on audiovisual speech recognition (AVSR) with excellent results.