Towards extracting semantically meaningful key frames from personal video clips: from humans to computers

  • Authors:
  • Jiebo Luo;Christophe Papin;Kathleen Costello

  • Affiliations:
  • Kodak Research Laboratories, Eastman Kodak Company, Rochester, NY;Multimedia Processing Group of Thales Communications France, Colombes Cedex, France and Kodak Research Laboratories, Eastman Kodak Company, Rochester, NY;Kodak Research Laboratories, Eastman Kodak Company, Rochester, NY

  • Venue:
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Extracting key frames from video is of great interest in many applications, such as video summary, video organization, video compression, and prints from video. Key frame extraction is not a new problem but existing literature has focused primarily on sports or news video. In the personal or consumer video space, the biggest challenges for key frame selection are the unconstrained content and lack of any pre-imposed structures. First, in a psychovisual study, we conduct ground truth collection of key frames from video clips taken by digital cameras (as opposed to camcorders) using both first- and third-party judges. The goals of this study are to: 1) create a reference database of video clips reasonably representative of the consumer video space; 2) identify consensus key frames by which automated algorithms can be compared and judged for effectiveness, i.e., ground truth; and 3) uncover the criteria used by both first- and third-party human judges so these criteria can influence algorithm design. Next, we develop an automatic key frame extraction method dedicated to summarizing consumer video clips acquired from digital cameras. Analysis of spatio-temporal changes over time provides semantically meaningful information about the scene and the camera operator's general intents. In particular, camera and object motion are estimated and used to derive motion descriptors. A video clip is segmented into homogeneous parts based on major types of camera motion (e.g., pan, zoom, pause, steady). Dedicated rules are used to extract candidate key frames from each segment. In addition, confidence measures are computed for the candidates to enable ranking in semantic relevance. This method is scalable so that one can produce any desired number of key frames from the candidates. Finally, we demonstrate the effectiveness of our method by comparing the results with two alternative methods against the ground truth agreed by multiple judges.