Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

  • Authors:
  • Wei Jiang;Alexander C. Loui

  • Affiliations:
  • Eastman Kodak Company, Rochester, NY, USA;Eastman Kodak Company, Rochester, NY, USA

  • Venue:
  • MM '11 Proceedings of the 19th ACM international conference on Multimedia
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate general concept classification in unconstrained videos by joint audio-visual analysis. A novel representation, the Audio-Visual Grouplet (AVG), is extracted by studying the statistical temporal audio-visual interactions. An AVG is defined as a set of audio and visual codewords that are grouped together according to their strong temporal correlations in videos. The AVGs carry unique audio-visual cues to represent the video content, based on which an audio-visual dictionary can be constructed for concept classification. By using the entire AVGs as building elements, the audio-visual dictionary is much more robust than traditional vocabularies that use discrete audio or visual codewords. Specifically, we conduct coarse-level foreground/background separation in both audio and visual channels, and discover four types of AVGs by exploring mixed-and-matched temporal audio-visual correlations among the following factors: visual foreground, visual background, audio foreground, and audio background. All of these types of AVGs provide discriminative audio-visual patterns for classifying various semantic concepts. We extensively evaluate our method over the large-scale Columbia Consumer Video set. Experiments demonstrate that the AVG-based dictionaries can achieve consistent and significant performance improvements compared with other state-of-the-art approaches.