Audio-visual atoms for generic video concept classification

Authors:
Wei Jiang;Courtenay Cotton;Shih-Fu Chang;Dan Ellis;Alexander C. Loui
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Eastman Kodak Company, Rochester, NY
Venue:
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Year:
2010

Citing 19
Cited 3

A framework for multiple-instance learning

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Learning Patterns of Activity Using Real-Time Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised Segmentation of Color-Texture Regions in Images and Video

IEEE Transactions on Pattern Analysis and Machine Intelligence
Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications

ECCV '96 Proceedings of the 4th European Conference on Computer Vision-Volume II - Volume II
A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Video retrieval using spatio-temporal descriptors

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Boosting Image Retrieval

International Journal of Computer Vision - Special Issue on Content-Based Image Retrieval
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Image Categorization by Learning and Reasoning with Regions

The Journal of Machine Learning Research
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Region-based Image Annotation using Asymmetrical Support Vector Machine-based Multiple-Instance Learning

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Analysis of vector space model and spatiotemporal segmentation for video indexing and retrieval

Proceedings of the 6th ACM international conference on Image and video retrieval
Audio-visual speech recognition using lip information extracted from side-face images

EURASIP Journal on Audio, Speech, and Music Processing
Kodak's consumer video benchmark data set: concept definition and annotation

Proceedings of the international workshop on Workshop on multimedia information retrieval
Large-scale multimodal semantic concept detection for consumer video

Proceedings of the international workshop on Workshop on multimedia information retrieval
Object tracking using SIFT features and mean shift

Computer Vision and Image Understanding
Audio-Visual Event Recognition in Surveillance Video Sequences

IEEE Transactions on Multimedia
Robust online appearance models for visual tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence

Consumer video understanding: a benchmark database and an evaluation of human and machine performance

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Near-lossless semantic video summarization and its applications to video analysis

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at concept detection. We extract a novel local representation, Audio-Visual Atom (AVA), which is defined as a region track associated with regional visual features and audio onset features. We develop a hierarchical algorithm to extract visual atoms from generic videos, and locate energy onsets from the corresponding soundtrack by time-frequency analysis. Audio atoms are extracted around energy onsets. Visual and audio atoms form AVAs, based on which discriminative audio-visual codebooks are constructed for concept detection. Experiments over Kodak's consumer benchmark videos confirm the effectiveness of our approach.