Adaptive fusion of acoustic and visual sources for automatic speech recognition
Speech Communication - Special issue on auditory-visual speech processing
Multi-stream adaptive evidence combination for noise robust ASR
Speech Communication - Special issue on noise robust ASR
Speech and Audio Signal Processing: Processing and Perception of Speech and Music
Speech and Audio Signal Processing: Processing and Perception of Speech and Music
Audio-visual speech modeling for continuous speech recognition
IEEE Transactions on Multimedia
Proceedings of the 5th international conference on Multimodal interfaces
Person identification using automatic integration of speech, lip, and face experts
WBMA '03 Proceedings of the 2003 ACM SIGMM workshop on Biometrics methods and applications
Graph based multi-modality learning
Proceedings of the 13th annual ACM international conference on Multimedia
Robust face-voice based speaker identity verification using multilevel fusion
Image and Vision Computing
Journal on Image and Video Processing - Anthropocentric Video Analysis: Tools and Applications
Reliability score based multimodal fusion for biometric person authentication
MATH'08 Proceedings of the American Conference on Applied Mathematics
Multi-stream Fusion for Speaker Classification
Speaker Classification I
Audio-Visual Clustering for 3D Speaker Localization
MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
Detection and localization of 3d audio-visual objects using unsupervised clustering
ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Proceedings of the International Conference on Management of Emergent Digital EcoSystems
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Unsupervised stream-weights computation in classification and recognition tasks
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Automatic visual feature extraction for mandarin audio-visual speech recognition
SMC'09 Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics
PSO based optimized reliability for robust multimodal speaker identification
CISST'10 Proceedings of the 4th WSEAS international conference on Circuits, systems, signal and telecommunications
Conjugate mixture models for clustering multimodal data
Neural Computation
Robust automatic human identification using face, mouth, and acoustic information
AMFG'05 Proceedings of the Second international conference on Analysis and Modelling of Faces and Gestures
AVBPA'05 Proceedings of the 5th international conference on Audio- and Video-Based Biometric Person Authentication
Hi-index | 0.00 |
It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI) architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.