A Hidden Markov Model Approach to the Structure of Documentaries

Authors:
Tiecheng Liu;John R. Kender
Affiliations:
-;-
Venue:
CBAIVL '00 Proceedings of the IEEE Workshop on Content-based Access of Image and Video Libraries (CBAIVL'00)
Year:
2000

Citing 0
Cited 7

Applications of Video-Content Analysis and Retrieval

IEEE MultiMedia
Hierarchical topical segmentation in instructional films based on cinematic expressive functions

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Fine-grained hidden markov modeling for broadcast-news story segmentation

HLT '01 Proceedings of the first international conference on Human language technology research
Associating characters with events in films

Proceedings of the 6th ACM international conference on Image and video retrieval
On supervision and statistical learning for semantic multimedia analysis

Journal of Visual Communication and Image Representation
Spatial-temporal semantic grouping of instructional video content

CIVR'03 Proceedings of the 2nd international conference on Image and video retrieval
P2P video synchronization in a collaborative virtual environment

ICWL'05 Proceedings of the 4th international conference on Advances in Web-Based Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have hand-segmented two very long documentaries (100 minutes total) into their component shots. As with other extended videos, shot distribution again appears to be lognormal. Shot lengths are similar to those in dramas, comedies, or action films, but much shorter than those in home videos are. The use of fades appears to be an important device to signal transitions between semantic units. We have sought evidence for shot composition rules by means of Hidden Markov Models (HMMs). We find that camera motion (tilt, pan, zoom) is not significantly governed by rules. However, the bulk of the documentaries take the form of an alternation between commentators and several types of primary supporting material; additionally, the documentaries end with a visual summary. We find that the best approach is one that trains the HMM with labeled subsequences that have approximately equal elapsed time, rather than subsequences with an equal number of shots, or subsequences with shots aligned to some semantic event. This may reflect fundamental temporal limits on human visual attention. We propose that such an underlying structure can suggest more human-sensitive designs for the analysis and graphic display of the contents of extended videos, for summarization, browsing, and indexing.