Sports video processing for description, summarization and search

  • Authors:
  • Ahmet Ekin;A. Murat Tekalp

  • Affiliations:
  • -;-

  • Venue:
  • Sports video processing for description, summarization and search
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This thesis proposes solutions for structural and semantic video modeling, automatic video analysis, and expressive video search and retrieval. We present a structural-semantic video model for effective representation of high- and low-level video information, an automatic, multi-modal sports video processing framework for instantiation of the model attributes and summarization, and, finally, a graph-based query formation and resolution framework for semantic search and retrieval based on the proposed model. Except for the video analysis algorithms, which are specific to sports video, the proposed structural-semantic video model and the graph-based querying framework are generic in the sense that they are applicable to description and querying of any type of video. We first introduce a structural-semantic video model for efficient description of high-level and low-level video features. The proposed model unifies the shot-based and object-based structural video models that are employed by video processing and computer vision communities with the entity-relationship (ER) or object-oriented models that are used by the database and information retrieval communities. This unified approach improves over the existing MPEG-7 approach that uses two description schemes (DS) for the same task. In order to instantiate model descriptors and generate automatic and real-time summaries of video, we focus on the domain of sports video because the extraction of high-level model entities from low-level video features necessitates the specification of a domain. We propose a multi-modal and scalable sports video processing framework for model descriptor instantiation and fast summarization of broadcast sports video. The proposed framework is multi-modal because it employs visual, audio, and text features, and is scalable because the system may generate descriptors in real-time or offline based upon user preferences and requirements. It is also applicable to multiple types of sports. The scalability of the framework results from the classification of visual features into cinematic and object-based features and efficient processing of them. Because cinematic features are easier to compute, we extract cinematic features, such as shot-boundaries, shot-types, and slow-motion replays, before object-based analysis that involves object detection and tracking. Real-time descriptors and summaries are computed by using only cinematic visual features as well as some audio and text features. Because some cinematic and object-based algorithms use features extracted from field region and most sporting events take place on a field with one distinct dominant color, we develop a robust low-level dominant color region detection algorithm that automatically detects the color of the field and adapts to the variations due to the changes in imaging conditions. (Abstract shortened by UMI.)