Automatic object model acquisition and object recognition by integrating linguistic and visual information

Authors:
Tomohide Shibata;Norio Kato;Sadao Kurohashi
Affiliations:
Kyoto University, Kyoto, Japan;University of Tokyo, Tokyo, Japan;Kyoto University, Kyoto, Japan
Venue:
Proceedings of the 15th international conference on Multimedia
Year:
2007

Citing 11
Cited 2

Attention, intentions, and the structure of discourse

Computational Linguistics
A syntactic analysis method of long Japanese sentences based on the detection of conjunctive structures

Computational Linguistics
Associating cooking video with related textbook

MULTIMEDIA '00 Proceedings of the 2000 ACM workshops on Multimedia
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Generic image classification using visual knowledge on the web

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
A bootstrapping approach to annotating large image collection

MIR '03 Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval
Fertilization of case frame dictionary for robust Japanese case analysis

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Discourse segmentation of multi-party conversation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
Unsupervised topic identification by integrating linguistic and visual information based on hidden Markov models

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Video Mining

Video Mining

A dialogue approach to learning object descriptions and semantic categories

Robotics and Autonomous Systems
Learning cooking techniques from youtube

MMM'10 Proceedings of the 16th international conference on Advances in Multimedia Modeling

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to make the best use of multimedia contents effectively, the crucial point is the structural analysis of the contents, in which several media processing techniques, including image, audio and text analyses, should be integrated. To understand utterances in videos in accordance with the scene, it is essential to recognize what object appears in the videos. In this paper, we focus on Japanese cooking TV videos, and propose a method for acquiring object models of foods in an unsupervised manner and performing object recognition based on the acquired object models. First, a topic of each video segment is identified based on HMMs to obtain good examples for the object model acquisition. After that, close-up images are extracted from image sequences, and an attention region on the close-up image is determined. Then, an important word is extracted as a keyword from utterances around the close-up image, and is made correspond to the close-up image. By collecting a set of close-up image and keyword from a large amount of videos, object models are acquired. After acquiring the object models, object recognition is performed based on the acquired object models and linguistic information. We conducted experiments on two kinds of cooking TV programs. We acquired the object models of around 100 foods with an accuracy 77.8%. The F measure of object recognition was 0.727.