Assembling personal speech collections by monologue scene detection from a news video archive

Authors:
Ichiro Ide;Naoki Sekioka;Tomokazu Takahashi;Hiroshi Murase
Affiliations:
Nagoya University, Nagoya, Japan and National Institute of Informatics;Nagoya University, Nagoya, Japan;Nagoya University, Japan;Nagoya University, Japan
Venue:
MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Year:
2006

Citing 4
Cited 0

Semantic analysis for video contents extraction—spotting by association in news video

MULTIMEDIA '97 Proceedings of the fifth ACM international conference on Multimedia
Name-It: Naming and Detecting Faces in News Videos

IEEE MultiMedia
Large scale evaluations of multimedia information retrieval: the TRECVid experience

CIVR'05 Proceedings of the 4th international conference on Image and Video Retrieval
Mining large-scale broadcast video archives towards inter-video structuring

PCM'04 Proceedings of the 5th Pacific Rim Conference on Advances in Multimedia Information Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Monologue scenes in news shows are important since they contain non-verbal information that could not be expressed through text media. In this paper, we propose a method that detects monologue scenes by individuals in news shows (news subjects) without external or prior knowledge on the show. The method first detects monologue scene candidates by face detection in the frame images, and then excludes scenes overlapped with speech by anchor-persons or reporters (news persons) by dynamically modeling them according to clues obtained from the closed-caption text and from the audio stream. As an application of monologue scene detection, we also propose a method which assembles personal speech collections per individual that appear in the news. Although the methods still need further improvement for realistic use, we confirmed the effectiveness of employing multimodal information for the tasks, and also saw interesting outputs from the automatically assembled speech collections.