A Graph Based Approach to Speaker Retrieval in Talk Show Videos with Transcript-Based Supervision

Authors:
Yina Han;Guizhong Liu;Hichem Sahbi;Gérard Chollet
Affiliations:
The School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China 710049;The School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China 710049;TELECOM-ParisTech, CNRS LTCI UMR 5141, Paris, France 75634;TELECOM-ParisTech, CNRS LTCI UMR 5141, Paris, France 75634
Venue:
PCM '09 Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Year:
2009

Citing 6
Cited 0

Mean Shift: A Robust Approach Toward Feature Space Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
Greedy approximation algorithms for finding dense components in a graph

APPROX '00 Proceedings of the Third International Workshop on Approximation Algorithms for Combinatorial Optimization
Name-It: Association of Face and Name in Video

CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
A Graph Based Approach for Naming Faces in News Photos

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Names and faces in the news

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Person spotting: video shot retrieval for face sets

CIVR'05 Proceedings of the 4th international conference on Image and Video Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a graph based strategy to retrieve frames containing the queried speakers in talk show videos. Based on who is speaking and when information from the audio transcript, an initial audio-based step, that restricts the queried person to frames corresponding to when he/she is speaking, with a second step that analyzes visual features of shots is combined. Specifically, based on the production property of talk show video, (1) Shot based graph is constructed first. Then the densest sub-graph is returned as the final result. But instead of direct search (DS) of the densest part, (2) We model the intra node connection and inter node connection by a frame layer degree map to take into account the duration information within each shot node; (3)A graph partition strategy without restriction on the shape and the number of sub-graphs is proposed, in which shots containing the same person are more similar to each other. Experiments on one episode of the French talk show "Le Grand Echiquier" show more than 10% improvement to audio only method and more than 7.5% improvement to DS method on average.