Consumer video understanding: a benchmark database and an evaluation of human and machine performance

Authors:
Yu-Gang Jiang;Guangnan Ye;Shih-Fu Chang;Daniel Ellis;Alexander C. Loui
Affiliations:
Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Columbia University, New York, NY;Kodak Research Labs, Rochester, NY
Venue:
Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Year:
2011

Citing 12
Cited 18

Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Scale & Affine Invariant Interest Point Detectors

International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
On Space-Time Interest Points

International Journal of Computer Vision
Actions as Space-Time Shapes

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Towards optimal bag-of-features for object categorization and semantic video retrieval

Proceedings of the 6th ACM international conference on Image and video retrieval
Kodak's consumer video benchmark data set: concept definition and annotation

Proceedings of the international workshop on Workshop on multimedia information retrieval
LabelMe: A Database and Web-Based Tool for Image Annotation

International Journal of Computer Vision
The Pascal Visual Object Classes (VOC) Challenge

International Journal of Computer Vision
Audio-visual atoms for generic video concept classification

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Audio-visual grouplet: temporal audio-visual interactions for general video concept classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Laplacian adaptive context-based SVM for video concept detection

WSM '11 Proceedings of the 3rd ACM SIGMM international workshop on Social media
SUPER: towards real-time event recognition in internet videos

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Joint audio-visual bi-modal codewords for video event detection

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Multimodal semantics extraction from user-generated videos

Advances in Multimedia
Submodular video hashing: a unified framework towards video pooling and indexing

Proceedings of the 20th ACM international conference on Multimedia
A fast video event recognition system and its application to video search

Proceedings of the 20th ACM international conference on Multimedia
Attribute learning for understanding unstructured social activity

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Explicit performance metric optimization for fusion-based video retrieval

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Consumer video dataset with marked head trajectories

Proceedings of the 4th ACM Multimedia Systems Conference
Blip10000: a social video dataset containing SPUG content for tagging and retrieval

Proceedings of the 4th ACM Multimedia Systems Conference
Segmental multi-way local pooling for video recognition

Proceedings of the 21st ACM international conference on Multimedia
Fast image/video collection summarization with local clustering

Proceedings of the 21st ACM international conference on Multimedia
Large-scale visual sentiment ontology and detectors using adjective noun pairs

Proceedings of the 21st ACM international conference on Multimedia
Human interaction categorization by using audio-visual cues

Machine Vision and Applications
Selection of negative samples and two-stage combination of multiple features for action detection in thousands of videos

Machine Vision and Applications
Discovering joint audio---visual codewords for video event detection

Machine Vision and Applications
An Efficient Gradient-based Approach to Optimizing Average Precision Through Maximal Figure-of-Merit Learning

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recognizing visual content in unconstrained videos has become a very important problem for many applications. Existing corpora for video analysis lack scale and/or content diversity, and thus limited the needed progress in this critical area. In this paper, we describe and release a new database called CCV, containing 9,317 web videos over 20 semantic categories, including events like "baseball" and "parade", scenes like "beach", and objects like "cat". The database was collected with extra care to ensure relevance to consumer interest and originality of video content without post-editing. Such videos typically have very little textual annotation and thus can benefit from the development of automatic content analysis techniques. We used Amazon MTurk platform to perform manual annotation, and studied the behaviors and performance of human annotators on MTurk. We also compared the abilities in understanding consumer video content by humans and machines. For the latter, we implemented automatic classifiers using state-of-the-art multi-modal approach that achieved top performance in recent TRECVID multimedia event detection task. Results confirmed classifiers fusing audio and video features significantly outperform single-modality solutions. We also found that humans are much better at understanding categories of nonrigid objects such as "cat", while current automatic techniques are relatively close to humans in recognizing categories that have distinctive background scenes or audio patterns.