Audio-Visual Speaker Recognition for Video Broadcast News

Authors:
Benoît Maison;Chalapathy Neti;Andrew Senior
Affiliations:
IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA;IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA;IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA
Venue:
Journal of VLSI Signal Processing Systems
Year:
2001

Citing 5
Cited 1

On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria

IEEE Transactions on Pattern Analysis and Machine Intelligence
The nature of statistical learning theory

The nature of statistical learning theory
Example-Based Learning for View-Based Human Face Detection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Mathematical Techniques in Multisensor Data Fusion

Mathematical Techniques in Multisensor Data Fusion
Recognizing Faces in Broadcast Video

RATFG-RTS '99 Proceedings of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems

Automated speech analysis applied to laryngeal disease categorization

Computer Methods and Programs in Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Audio-based speaker identification degrades severely when there is a mismatch between training and test conditions due either to channel or to noise. In this paper, we explore various techniques to combine video based speaker identification with audio-based speaker identification to improve the performance under mismatched conditions. Specifically, we explore techniques to optimally determine the relative weights of the independent decisions based on audio and video to achieve the best combination. Experiments on video broadcast news data show that significant improvements can be achieved by the fusion in acoustically degraded conditions.