Automatic object model acquisition and object recognition by integrating linguistic and visual information

  • Authors:
  • Tomohide Shibata;Norio Kato;Sadao Kurohashi

  • Affiliations:
  • Kyoto University, Kyoto, Japan;University of Tokyo, Tokyo, Japan;Kyoto University, Kyoto, Japan

  • Venue:
  • Proceedings of the 15th international conference on Multimedia
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In order to make the best use of multimedia contents effectively, the crucial point is the structural analysis of the contents, in which several media processing techniques, including image, audio and text analyses, should be integrated. To understand utterances in videos in accordance with the scene, it is essential to recognize what object appears in the videos. In this paper, we focus on Japanese cooking TV videos, and propose a method for acquiring object models of foods in an unsupervised manner and performing object recognition based on the acquired object models. First, a topic of each video segment is identified based on HMMs to obtain good examples for the object model acquisition. After that, close-up images are extracted from image sequences, and an attention region on the close-up image is determined. Then, an important word is extracted as a keyword from utterances around the close-up image, and is made correspond to the close-up image. By collecting a set of close-up image and keyword from a large amount of videos, object models are acquired. After acquiring the object models, object recognition is performed based on the acquired object models and linguistic information. We conducted experiments on two kinds of cooking TV programs. We acquired the object models of around 100 foods with an accuracy 77.8%. The F measure of object recognition was 0.727.