Attention, intentions, and the structure of discourse
Computational Linguistics
Computational Linguistics
Associating cooking video with related textbook
MULTIMEDIA '00 Proceedings of the 2000 ACM workshops on Multimedia
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Generic image classification using visual knowledge on the web
MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
A bootstrapping approach to annotating large image collection
MIR '03 Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval
Fertilization of case frame dictionary for robust Japanese case analysis
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Discourse segmentation of multi-party conversation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Evaluation campaigns and TRECVid
MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Video Mining
A dialogue approach to learning object descriptions and semantic categories
Robotics and Autonomous Systems
Learning cooking techniques from youtube
MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling
Hi-index | 0.00 |
In order to make the best use of multimedia contents effectively, the crucial point is the structural analysis of the contents, in which several media processing techniques, including image, audio and text analyses, should be integrated. To understand utterances in videos in accordance with the scene, it is essential to recognize what object appears in the videos. In this paper, we focus on Japanese cooking TV videos, and propose a method for acquiring object models of foods in an unsupervised manner and performing object recognition based on the acquired object models. First, a topic of each video segment is identified based on HMMs to obtain good examples for the object model acquisition. After that, close-up images are extracted from image sequences, and an attention region on the close-up image is determined. Then, an important word is extracted as a keyword from utterances around the close-up image, and is made correspond to the close-up image. By collecting a set of close-up image and keyword from a large amount of videos, object models are acquired. After acquiring the object models, object recognition is performed based on the acquired object models and linguistic information. We conducted experiments on two kinds of cooking TV programs. We acquired the object models of around 100 foods with an accuracy 77.8%. The F measure of object recognition was 0.727.