Video2Text: Learning to Annotate Video Content

Authors:
Hrishikesh Aradhye;George Toderici;Jay Yagnik
Affiliations:
-;-;-
Venue:
ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops
Year:
2009

Citing 0
Cited 9

Learning When Concepts Abound

The Journal of Machine Learning Research
Multimedia data mining: state of the art and challenges

Multimedia Tools and Applications
Improved video categorization from text metadata and user comments

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Learning heterogeneous data for hierarchical web video classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
The million song dataset challenge

Proceedings of the 21st international conference companion on World Wide Web
Effective web video clustering using playlist information

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Linking visual concept detection with viewer demographics

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Dynamic vocabularies for web-based concept detection by trend discovery

Proceedings of the 20th ACM international conference on Multimedia
Large-scale visual sentiment ontology and detectors using adjective noun pairs

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses a new method for automatic discovery and organization of descriptive concepts (labels) within large real-world corpora of user-uploaded multimedia, such as YouTube. com. Conversely, it also provides validation of existing labels, if any. While training, our method does not assume any explicit manual annotation other than the weak labels already available in the form of video title, description, and tags. Prior work related to such auto-annotation assumed that a vocabulary of labels of interest (e. g., indoor, outdoor, city, landscape) is specified a priori. In contrast, the proposed method begins with an empty vocabulary. It analyzes audiovisual features of 25 million YouTube. com videos -- nearly 150 years of video data -- effectively searching for consistent correlation between these features and text metadata. It autonomously extends the label vocabulary as and when it discovers concepts it can reliably identify, eventually leading to a vocabulary with thousands of labels and growing. We believe that this work significantly extends the state of the art in multimedia data mining, discovery, and organization based on the technical merit of the proposed ideas as well as the enormous scale of the mining exercise in a very challenging, unconstrained, noisy domain.