Towards textually describing complex video contents with audio-visual concept classifiers

Authors:
Chun Chet Tan;Yu-Gang Jiang;Chong-Wah Ngo
Affiliations:
City University of Hong Kong, Kowloon, Hong Kong;Fudan University, Shanghai, China;City University of Hong Kong, Kowloon, Hong Kong
Venue:
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Year:
2011

Citing 7
Cited 5

Scale & Affine Invariant Interest Point Detectors

International Journal of Computer Vision
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
On Space-Time Interest Points

International Journal of Computer Vision
The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Concept-Based Video Retrieval

Foundations and Trends in Information Retrieval
Audio-based semantic concept classification for consumer video

IEEE Transactions on Audio, Speech, and Language Processing
Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News

IEEE Transactions on Multimedia

The ACM Multimedia Grand Challenge 2011 in a nutshell

ACM SIGMultimedia Records
Beyond audio and video retrieval: towards multimedia summarization

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Short user-generated videos classification using accompanied audio categories

Proceedings of the 2012 ACM international workshop on Audio and multimedia methods for large-scale video analysis
Generating natural language summaries for multimedia

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Multimedia event recounting with concept based representation

Proceedings of the 20th ACM international conference on Multimedia

Quantified Score

Hi-index	0.01

Visualization

Abstract

Automatically generating compact textual descriptions of complex video contents has wide applications. With the recent advancements in automatic audio-visual content recognition, in this paper we explore the technical feasibility of the challenging issue of precisely recounting video contents. Based on cutting-edge automatic recognition techniques, we start from classifying a variety of visual and audio concepts in video contents. According to the classification results, we apply simple rule-based methods to generate textual descriptions of video contents. Results are evaluated by conducting carefully designed user studies. We find that the state-of-the-art visual and audio concept classification, although far from perfect, is able to provide very useful clues indicating what is happening in the videos. Most users involved in the evaluation confirmed the informativeness of our machine-generated descriptions.