Adaptive speaker identification with audiovisual cues for movie content analysis

Authors:
Ying Li;Shrikanth S. Narayanan;C.-C. Jay Kuo
Affiliations:
Integrated Media Systems Center, Department of Electrical Engineering, University of Southern California, Los Angeles, CA;Integrated Media Systems Center, Department of Electrical Engineering, University of Southern California, Los Angeles, CA;Integrated Media Systems Center, Department of Electrical Engineering, University of Southern California, Los Angeles, CA
Venue:
Pattern Recognition Letters - Video computing
Year:
2004

Citing 4
Cited 4

A Role-Based Access Control Model and Implementation for Data-Centric Enterprise Applications

ICICS '01 Proceedings of the Third International Conference on Information and Communications Security
Detection of target speakers in audio databases

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
Effective speaker adaptations for speaker verification

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 02
Content-based video parsing and indexing based on audio-visual interaction

IEEE Transactions on Circuits and Systems for Video Technology

ACADI showcase - automatic character indexing in audiovisual document

Proceedings of the 6th ACM international conference on Image and video retrieval
Automated speech analysis applied to laryngeal disease categorization

Computer Methods and Programs in Biomedicine
Vision and RFID data fusion for tracking people in crowds by a mobile robot

Computer Vision and Image Understanding
Dynamic Database Creation for Speaker Recognition System

Proceedings of International Conference on Advances in Mobile Computing & Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

An adaptive speaker identification system which employs both audio and visual cues is proposed in this work for movie content analysis. Specifically, a likelihood-based approach is first applied for speaker identification using pure speech data, and techniques such as face detection/recognition and mouth tracking are applied for talking face recognition using pure visual data. These two information cues are then effectively integrated under a probabilistic framework for achieving more robust results. Moreover, to account for speakers' voice variations along time, we propose to update their acoustic models on the fly by adapting to their incoming speech data. An improved system performance (80% identification accuracy) has been observed on two test movies.