Modeling tagged photos for automatic image annotation

  • Authors:
  • Neela Sawant

  • Affiliations:
  • The Pennsylvania State University, University Park, PA, USA

  • Venue:
  • MM '11 Proceedings of the 19th ACM international conference on Multimedia
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A semantic concept can be a physical object (e.g., 'car', 'zebra-fish'), an activity (e.g., 'demonstration', 'running') or an obscure category (e.g., 'historical', 'autumn'). In automatic image annotation, machines are taught to infer such semantic concepts by training with hundreds of manually selected images that contain those concepts. So far, remarkable progress has been made in the areas of visual feature extraction and machine learning algorithms that relate features to concepts. Now the new challenge is to scale the inference and annotation capacity to thousands of semantic concepts in the real world. The key to large-scale machine annotation lies in automatic training data selection from user-tagged photos shared on social websites. To learn an annotation model for a concept such as 'car', one can exploit the fact that 'car' is also a common tag associated with photos of day-to-day activities. We may consider the concept to be illustrated by the collection of all photos thus annotated. The photos can be retrieved through a text-search interface and readily used to train the annotation model. While this seems like a straightforward idea, a few ramifications have to be considered: Substandard images: Tagging is an uncontrolled activity not geared towards scientific computation, but guided by personal motivations and communal influences. Tags may be incorrect or incomplete and some images may be poor examples of the target concept. Such substandard training images may hamper the annotation performance. Modeling constraints: As tags as well as visual features determine the quality of training data, a joint analysis of visual-textual features is necessary. Additionally, a diverse set of features need to be effectively combined even if some features are incapable of modeling specific concepts in the large annotation vocabulary. Efficient computing techniques and infrastructure is necessary to maintain scalability and adaptability to new concepts and training data. We recently surveyed the application of tagged photographs for image annotation [2]. The literature was organized to address four main annotation types: (a) general-purpose concepts, (b) names of people, (c) names of locations, and (d) events. An important observation was that a majority of studies resorted to semi-supervised learning to annotate an unlabeled (or partially labeled) image. Specifically, content based retrieval techniques were used to identify similar images and a tag-ranking model was developed to transfer the annotations of the retrieved results on to the query image. Such data-driven techniques can potentially access an arbitrarily large annotation vocabulary. However, they do not conform to the idea of model-based vision and only work if large labeled image sets can be analyzed at run-time. We adopted a supervised learning approach to large-scale image annotation where training data was selected from Flickr images [3]. The selection process was designed to reject substandard images. Annotation models were trained and stored for 1000 words, so that only pixel information of test images needed to be analyzed at run-time. The time required to select training data for a single concept from 10,000 images (634-dimensional feature) using a single CPU of 2.66 GHz speed and 24.4 GB memory was under 5 minutes. Also, the results as compared to a state-of-the-art annotation system showed marked diversity and accuracy. The supervised annotation approach can only predict the concepts used in its training. In this sense, the approach does not cater to the preferred vocabularies of users. To address this issue, we developed a personalized tagging extension [1]. We proposed a transfer learning model to translate the set of machine annotations to a user's vocabulary using a Naïve Bayes formulation. The highlight of the technique was the computation of the translation model from the collective tagging behavior of the user's local social network. In conclusion, the proposed research harnesses user-tagged images and social interactions on photo sharing websites to develop a practical image annotation system.