Content-based multimedia information retrieval: State of the art and challenges
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
How many high-level concepts will fill the semantic gap in news video retrieval?
Proceedings of the 6th ACM international conference on Image and video retrieval
Foundations and Trends in Information Retrieval
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Acoustic super models for large scale video event detection
J-MRE '11 Proceedings of the 2011 joint ACM workshop on Modeling and representing events
Audio Keywords Discovery for Text-Like Audio Content Analysis and Retrieval
IEEE Transactions on Multimedia
International Journal of Multimedia Data Engineering & Management
AMVA'12: ACM international workshop on audio and multimedia methods for large-scale video analysis
Proceedings of the 20th ACM international conference on Multimedia
Scalable multimedia content analysis on parallel platforms using python
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Hi-index | 0.00 |
Video concept detection aims to find videos that show a certain event described as a high-level concept, e.g. "wedding ceremony" or "changing a tire". This paper presents a theoretical framework and experimental evidence suggesting that video concept detection on consumer-produced videos can be performed by what we call "percepts", which is a set of observable units with Zipfian distribution. We present an unsupervised approach to extract percepts from audio tracks, which we then use to perform experiments to provide evidence for the validity of the proposed theoretical framework using the TRECVID MED 2011 dataset. The approach suggest selecting the most relevant percepts for each concept automatically, thereby actually filtering, selecting and reducing the amount of training data needed. It is show that our framework provides a highly usable foundation for doing video retrieval on consumer-produced content and is applicable for acoustic, visual, as well as multimodal content analysis.