Scale-invariant visual language modeling for object categorization

  • Authors:
  • Lei Wu;Yang Hu;Mingjing Li;Nenghai Yu;Xian-Sheng Hua

  • Affiliations:
  • MOE, Microsoft Key Laboratory of Multimedia Computing and Communication, Department of Electrical Engineering and Information Science, University of Science of Technology of China, Hefei, China;MOE, Microsoft Key Laboratory of Multimedia Computing and Communication, Department of Electrical Engineering and Information Science, University of Science of Technology of China, Hefei, China;MOE, Microsoft Key Laboratory of Multimedia Computing and Communication, Department of Electrical Engineering and Information Science, University of Science of Technology of China, Hefei, China;MOE, Microsoft Key Laboratory of Multimedia Computing and Communication, Department of Electrical Engineering and Information Science, University of Science of Technology of China, Hefei, China;Microsoft Research Asia, Beijing, China

  • Venue:
  • IEEE Transactions on Multimedia - Special issue on integration of context and content
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, "bag-of-words" models, which treat an image as a collection of unordered visual words, have been widely applied in the multimedia and computer vision fields. However, their ignorance of the spatial structure among visual words makes them indiscriminative for objects with similar word frequencies but different word spatial distributions. In this paper, we propose a visual language modeling method (VLM), which incorporates the spatial context of the local appearance features into the statistical language model. To represent the object categories, models with different orders of statistical dependencies have been exploited. In addition, the multilayer extension to the VLM makes it more resistant to scale variations of objects. The model is effective and applicable to large scale image categorization. We train scale invariant visual language models based on the images which are grouped by Flickr tags, and use these models for object categorization. Experimental results show they achieve better performance than single layer visual language models and "bag-of-words" models. They also achieve comparable performance with 2-D MHMM and SVM-based methods, while costing much less computational time.