Noise adaptive stream weighting in audio-visual speech recognition

Authors:
Martin Heckmann;Frédéric Berthommier;Kristian Kroschel
Affiliations:
Institut für Nachrichtentechnik, Universität Karlsruhe, Karlsruhe, Germany;Institut de la Communication Parlée (ICP), Institut National Polytechnique de Grenoble, Grenoble, France;Institut für Nachrichtentechnik, Universität Karlsruhe, Kaiserstraße, Karlsruhe, Germany
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2002

Citing 4
Cited 17

Adaptive fusion of acoustic and visual sources for automatic speech recognition

Speech Communication - Special issue on auditory-visual speech processing
Multi-stream adaptive evidence combination for noise robust ASR

Speech Communication - Special issue on noise robust ASR
Speech and Audio Signal Processing: Processing and Perception of Speech and Music

Speech and Audio Signal Processing: Processing and Perception of Speech and Music
Audio-visual speech modeling for continuous speech recognition

IEEE Transactions on Multimedia

Modeling multimodal integration patterns and performance in seniors: toward adaptive processing of individual differences

Proceedings of the 5th international conference on Multimodal interfaces
Person identification using automatic integration of speech, lip, and face experts

WBMA '03 Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications
Graph based multi-modality learning

Proceedings of the 13th annual ACM international conference on Multimedia
Robust face-voice based speaker identity verification using multilevel fusion

Image and Vision Computing
Comparison of image transform-based features for visual speech recognition in clean and corrupted videos

Journal on Image and Video Processing - Anthropocentric Video Analysis: Tools and Applications
Reliability score based multimodal fusion for biometric person authentication

MATH'08 Proceedings of the American Conference on Applied Mathematics
Multi-stream Fusion for Speaker Classification

Speaker Classification I
Audio-Visual Clustering for 3D Speaker Localization

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Detection and localization of 3d audio-visual objects using unsupervised clustering

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Robust audio-visual speaker identification using a modified score-based reliability in modality integration

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Unsupervised stream-weights computation in classification and recognition tasks

IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Automatic visual feature extraction for mandarin audio-visual speech recognition

SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
PSO based optimized reliability for robust multimodal speaker identification

CISST'10 Proceedings of the 4th WSEAS international conference on Circuits, systems, signal and telecommunications
Conjugate mixture models for clustering multimodal data

Neural Computation
Robust automatic human identification using face, mouth, and acoustic information

AMFG'05 Proceedings of the Second international conference on Analysis and Modelling of Faces and Gestures
Audio-Visual speaker identification via adaptive fusion using reliability estimates of both modalities

AVBPA'05 Proceedings of the 5th international conference on Audio- and Video-Based Biometric Person Authentication

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.