Audiovisual diarization of people in video content

Authors:
Elie El Khoury;Christine Sénac;Philippe Joly
Affiliations:
Idiap Research Institute, Martigny, Switzerland and Laboratoire d'Informatique de l'Université du Maine, Le Mans, France;Institut de Recherche en Informatique de Toulouse, Toulouse, France;Institut de Recherche en Informatique de Toulouse, Toulouse, France
Venue:
Multimedia Tools and Applications
Year:
2014

Citing 33
Cited 0

Texture Features for Browsing and Retrieval of Image Data

IEEE Transactions on Pattern Analysis and Machine Intelligence
New enhancements to cut, fade, and dissolve detection processes in video segmentation

MULTIMEDIA '00 Proceedings of the eighth ACM international conference on Multimedia
Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Visually Controlled Graphics

IEEE Transactions on Pattern Analysis and Machine Intelligence
On Affine Invariant Clustering and Automatic Cast Listing in Movies

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part III
Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Detecting Pedestrians Using Patterns of Motion and Appearance

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Robust Real-Time Face Detection

International Journal of Computer Vision
Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Journal of VLSI Signal Processing Systems
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Automatic Face Recognition for Film Character Retrieval in Feature-Length Films

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
On the Use of SIFT Features for Face Authentication

CVPRW '06 Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop
Segregation of speakers for speech recognition and speaker identification

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
A fusion study in speech / music classification

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 2
Information fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction

ICME '03 Proceedings of the 2003 International Conference on Multimedia and Expo - Volume 3 (ICME '03) - Volume 03
Audio Segmentation and Speaker Localization in Meeting Videos

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 02
Major cast detection in video using both audio and visual information

ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 2001. on IEEE International Conference - Volume 03
Pose Robust Face Tracking by Combining Active Appearance Models and Cylinder Head Models

International Journal of Computer Vision
Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers

Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers
Multi-stage Speaker Diarization for Conference and Lecture Meetings

Multimodal Technologies for Perception of Humans
Automatic Classification Video for Person Indexing

CISP '08 Proceedings of the 2008 Congress on Image and Signal Processing, Vol. 2 - Volume 02
Taking the bite out of automated naming of characters in TV video

Image and Vision Computing
Tracking and Retexturing Cloth for Real-Time Virtual Clothing Applications

MIRAGE '09 Proceedings of the 4th International Conference on Computer Vision/Computer Graphics CollaborationTechniques
Multi-modal speaker diarization of real-world meetings using compressed-domain video features

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Improved speaker diarization system for meetings

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Visual language model for face clustering in consumer photos

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Face-and-clothing based people clustering in video content

Proceedings of the international conference on Multimedia information retrieval
Tracking multiple people with recovery from partial and total occlusion

Pattern Recognition
Video shot boundary detection: Seven years of TRECVid activity

Computer Vision and Image Understanding
Speaker localisation using audio-visual synchrony: an empirical study

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Major Cast Detection in Video Using Both Speaker and Face Information

IEEE Transactions on Multimedia
Unsupervised metric learning for face identification in TV video

ICCV '11 Proceedings of the 2011 International Conference on Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.