Movie/Script: Alignment and Parsing of Video and Text Transcription

Authors:
Timothee Cour;Chris Jordan;Eleni Miltsakaki;Ben Taskar
Affiliations:
University of Pennsylvania, Philadelphia, PA, USA 19104;University of Pennsylvania, Philadelphia, PA, USA 19104;University of Pennsylvania, Philadelphia, PA, USA 19104;University of Pennsylvania, Philadelphia, PA, USA 19104
Venue:
ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part IV
Year:
2008

Citing 0
Cited 13

Automatic retrieval of visual continuity errors in movies

Proceedings of the ACM International Conference on Image and Video Retrieval
A Novel Role-Based Movie Scene Segmentation Method

PCM '09 Proceedings of the 10th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Character identification in feature-length films using global face-name matching

IEEE Transactions on Multimedia
A survey on vision-based human action recognition

Image and Vision Computing
A generic framework for event detection in various video domains

Proceedings of the international conference on Multimedia
Modeling the temporal extent of actions

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Learning relations among movie characters: a social network perspective

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Learning from Partial Labels

The Journal of Machine Learning Research
Social network analysis in a movie using character-net

Multimedia Tools and Applications
Synergistic methods for using language in robotics

Proceedings of the Workshop on Performance Metrics for Intelligent Systems
Human focused action localization in video

ECCV'10 Proceedings of the 11th European conference on Trends and Topics in Computer Vision - Volume Part I
Emotion-based character clustering for managing story-based contents: a cinemetric analysis

Multimedia Tools and Applications
Unsupervised language learning for discovered visual concepts

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales "in the wild". Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highly-varied datasets. To enable such collection, we focus on the task of recovering scene structure in movies and TV series for object tracking and action retrieval. We present a weakly supervised algorithm that uses the screenplay and closed captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries in the movie are aligned with screenplay scene labels and shots are reordered into a sequence of long continuous tracks or threads which allow for more accurate tracking of people, actions and objects. Scene segmentation, alignment, and shot threading are formulated as inference in a unified generative model and a novel hierarchical dynamic programming algorithm that can handle alignment and jump-limited reorderings in linear time is presented. We present quantitative and qualitative results on movie alignment and parsing, and use the recovered structure to improve character naming and retrieval of common actions in several episodes of popular TV series.