Video summaries and cross-referencing through mosaic-based representation

  • Authors:
  • Aya Aner-Wolf;John R. Kender

  • Affiliations:
  • Department of Math and Computer Science, The Weizmann Institute of Science, Rehovot 76100, Israel;Department of Computer Science, Columbia University, New York, NY

  • Venue:
  • Computer Vision and Image Understanding
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an approach for compact video summaries that allows last and direct access to video data. The video is segmented into shots and, in appropriate video genres, into scenes, using previously proposed methods. A new concept that supports an hierarchical representation of video is presented, and is based on physical setting and camera locations. We use mosaics to represent shots and then scenes. We use a novel method for mosaic comparison which is robust against changes in viewpoint and illumination. In contrast to approaches to video indexing which rely on a frame-based representation, our efficient mosaic-based representation allows fast clustering of scenes into physical settings, a new conceptual form grounded in the recognition of real-world backgrounds. We employ a technique for choosing representative mosaics for each physical setting, for a more compact representation and faster comparison between settings. This compact representation and comparison method runs in real time and allows fast and accurate summaries and comparison of scenes across different videos, and serves as a basis for indexing video libraries. We demonstrate our work using situation comedies (sitcoms), where each half-hour episode is well structured by rules governing background use. Consequently, browsing, indexing, and comparison across videos by physical setting is very fast. Further, we show that physical settings lead to a higher-level contextual identification of the main plots in each video. We demonstrate these contributions with a browsing tool whose top-level single page displays the settings of several episodes. In sports videos where settings are not as well defined, our approach allows classifying shots for characteristic event detection.