Video abstraction: A systematic review and classification
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Clever clustering vs. simple speed-up for summarizing rushes
Proceedings of the international workshop on TRECVID video summarization
Automated Metadata in Multimedia Information Systems: Creation, Refinement, Use in Surrogates, and Evaluation
Multimedia surrogates for video gisting: Toward combining spoken words and imagery
Information Processing and Management: an International Journal
Explicit and implicit concept-based video retrieval with bipartite graph propagation model
Proceedings of the international conference on Multimedia
Towards textually describing complex video contents with audio-visual concept classifiers
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Understanding images with natural sentences
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Generating natural language summaries for multimedia
INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
We are not equally negative: fine-grained labeling for multimedia event detection
Proceedings of the 21st ACM international conference on Multimedia
Hi-index | 0.00 |
Given the deluge of multimedia content that is becoming available over the Internet, it is increasingly important to be able to effectively examine and organize these large stores of information in ways that go beyond browsing or collaborative filtering. In this paper we review previous work on audio and video processing, and define the task of Topic-Oriented Multimedia Summarization (TOMS) using natural language generation: given a set of automatically extracted features from a video (such as visual concepts and ASR transcripts) a TOMS system will automatically generate a paragraph of natural language ("a recounting"), which summarizes the important information in a video belonging to a certain topic area, and provides explanations for why a video was matched and retrieved. We see this as a first step towards systems that will be able to discriminate visually similar, but semantically different videos, compare two videos and provide textual output or summarize a large number of videos at once. In this paper, we introduce our approach of solving the TOMS problem. We extract visual concept features and ASR transcription features from a given video, and develop a template-based natural language generation system to produce a textual recounting based on the extracted features. We also propose possible experimental designs for continuously evaluating and improving TOMS systems, and present results of a pilot evaluation of our initial system.