Short-term audio-visual atoms for generic video concept classification

Authors:
Wei Jiang;Courtenay Cotton;Shih-Fu Chang;Dan Ellis;Alexander Loui
Affiliations:
Columbia University, New York, NY, USA;Columbia University, New York, NY, USA;Columbia University, New York, NY, USA;Columbia University, New York, NY, USA;Eastman Kodak, Rochester, NY, USA
Venue:
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Year:
2009

Citing 22
Cited 10

A framework for multiple-instance learning

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Learning Patterns of Activity Using Real-Time Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised Segmentation of Color-Texture Regions in Images and Video

IEEE Transactions on Pattern Analysis and Machine Intelligence
Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications

ECCV '96 Proceedings of the 4th European Conference on Computer Vision-Volume II - Volume II
A Graphical Model for Audiovisual Object Tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence
Video retrieval using spatio-temporal descriptors

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Boosting Image Retrieval

International Journal of Computer Vision - Special Issue on Content-Based Image Retrieval
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Image Categorization by Learning and Reasoning with Regions

The Journal of Machine Learning Research
Kernel-Based Bayesian Filtering for Object Tracking

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Region-based Image Annotation using Asymmetrical Support Vector Machine-based Multiple-Instance Learning

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Audio-visual speech recognition using lip information extracted from side-face images

EURASIP Journal on Audio, Speech, and Music Processing
Kodak's consumer video benchmark data set: concept definition and annotation

Proceedings of the international workshop on Workshop on multimedia information retrieval
Large-scale multimodal semantic concept detection for consumer video

Proceedings of the international workshop on Workshop on multimedia information retrieval
Large head movement tracking using sift-based registration

Proceedings of the 15th international conference on Multimedia
Extracting Moving People from Internet Videos

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part IV
Object tracking using SIFT features and mean shift

Computer Vision and Image Understanding
Learning semantic scene models by trajectory analysis

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part III
Audio-Visual Event Recognition in Surveillance Video Sequences

IEEE Transactions on Multimedia
Robust online appearance models for visual tracking

IEEE Transactions on Pattern Analysis and Machine Intelligence

Concept detector refinement using social videos

Proceedings of the international workshop on Very-large-scale multimedia corpus, mining and retrieval
Parametric time-frequency analysis and its applications in music classification

EURASIP Journal on Advances in Signal Processing
ShotTagger: tag location for internet videos

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Audio-visual fusion using bayesian model combination for web video retrieval

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Multimodal video concept detection via bag of auditory words and multiple kernel learning

MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
A novel multi-modal integration and propagation model for cross-media information retrieval

MMM'12 Proceedings of the 18th international conference on Advances in Multimedia Modeling
Joint audio-visual bi-modal codewords for video event detection

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Audio-visual robot command recognition: D-META'12 grand challenge

Proceedings of the 14th ACM international conference on Multimodal interaction
Marginalized multi-layer multi-instance kernel for video concept detection

Signal Processing
Discovering joint audio---visual codewords for video event detection

Machine Vision and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named Short-Term Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak's consumer benchmark video set from real users. Experimental results confirm significant performance improvements - over 120% MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5% (in terms of AP) over 21 concepts, with many concepts achieving more than 20%.