Evaluating multimedia features and fusion for example-based event detection

  • Authors:
  • Gregory K. Myers;Ramesh Nallapati;Julien Hout;Stephanie Pancoast;Ramakant Nevatia;Chen Sun;Amirhossein Habibian;Dennis C. Koelma;Koen E. Sande;Arnold W. Smeulders;Cees G. Snoek

  • Affiliations:
  • SRI International (SRI), Menlo Park, USA 94025;SRI International (SRI), Menlo Park, USA 94025 and IBM Thomas J Watson Research Center, Yorktown Heights, USA 10598;SRI International (SRI), Menlo Park, USA 94025;SRI International (SRI), Menlo Park, USA 94025;Institute for Robotics and Intelligent Systems, University of Southern California (USC), Los Angeles, USA 90089-0273;Institute for Robotics and Intelligent Systems, University of Southern California (USC), Los Angeles, USA 90089-0273;University of Amsterdam (UvA), Amsterdam, The Netherlands 1098 GH;University of Amsterdam (UvA), Amsterdam, The Netherlands 1098 GH;University of Amsterdam (UvA), Amsterdam, The Netherlands 1098 GH;University of Amsterdam (UvA), Amsterdam, The Netherlands 1098 GH;University of Amsterdam (UvA), Amsterdam, The Netherlands 1098 GH

  • Venue:
  • Machine Vision and Applications
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME's performance in the 2012 TRECVID MED evaluation was one of the best reported.