Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

Authors:
Wei Jiang;Alexander C. Loui
Affiliations:
Eastman Kodak Company, Rochester, NY, USA;Eastman Kodak Company, Rochester, NY, USA
Venue:
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Year:
2011

Citing 18
Cited 4

Video tomography: an efficient method for camerawork extraction and motion analysis

MULTIMEDIA '94 Proceedings of the second ACM international conference on Multimedia
Learning Patterns of Activity Using Real-Time Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
News video classification using SVM-based multimodal classifiers and combination strategies

Proceedings of the tenth ACM international conference on Multimedia
A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Scalable Recognition with a Vocabulary Tree

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Towards optimal bag-of-features for object categorization and semantic video retrieval

Proceedings of the 6th ACM international conference on Image and video retrieval
Audio-visual speech recognition using lip information extracted from side-face images

EURASIP Journal on Audio, Speech, and Music Processing
Evaluating bag-of-visual-words representations in scene classification

Proceedings of the international workshop on Workshop on multimedia information retrieval
Large-scale multimodal semantic concept detection for consumer video

Proceedings of the international workshop on Workshop on multimedia information retrieval
Audiovisual celebrity recognition in unconstrained web videos

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
The Pascal Visual Object Classes (VOC) Challenge

International Journal of Computer Vision
Audio-visual atoms for generic video concept classification

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Consumer video understanding: a benchmark database and an evaluation of human and machine performance

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Blind separation of instantaneous mixtures of nonstationary sources

IEEE Transactions on Signal Processing
Audio-Visual Event Recognition in Surveillance Video Sequences

IEEE Transactions on Multimedia

Joint audio-visual bi-modal codewords for video event detection

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Bimodal log-linear regression for fusion of audio and visual features

Proceedings of the 21st ACM international conference on Multimedia
Multimedia event detection with multimodal feature fusion and temporal concept localization

Machine Vision and Applications
Discovering joint audio---visual codewords for video event detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate general concept classification in unconstrained videos by joint audio-visual analysis. A novel representation, the Audio-Visual Grouplet (AVG), is extracted by studying the statistical temporal audio-visual interactions. An AVG is defined as a set of audio and visual codewords that are grouped together according to their strong temporal correlations in videos. The AVGs carry unique audio-visual cues to represent the video content, based on which an audio-visual dictionary can be constructed for concept classification. By using the entire AVGs as building elements, the audio-visual dictionary is much more robust than traditional vocabularies that use discrete audio or visual codewords. Specifically, we conduct coarse-level foreground/background separation in both audio and visual channels, and discover four types of AVGs by exploring mixed-and-matched temporal audio-visual correlations among the following factors: visual foreground, visual background, audio foreground, and audio background. All of these types of AVGs provide discriminative audio-visual patterns for classifying various semantic concepts. We extensively evaluate our method over the large-scale Columbia Consumer Video set. Experiments demonstrate that the AVG-based dictionaries can achieve consistent and significant performance improvements compared with other state-of-the-art approaches.