Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search

Authors:
Sung Ju Hwang;Kristen Grauman
Affiliations:
Department of Computer Science, University of Texas at Austin, Austin, USA 78712;Department of Computer Science, University of Texas at Austin, Austin, USA 78712
Venue:
International Journal of Computer Vision
Year:
2012

Citing 21
Cited 3

Content-Based Image Retrieval at the End of the Early Years

IEEE Transactions on Pattern Analysis and Machine Intelligence
Saliency, Scale and Image Description

International Journal of Computer Vision
Modern Information Retrieval

Modern Information Retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Contextual Priming for Object Detection

International Journal of Computer Vision
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Matching words and pictures

The Journal of Machine Learning Research
On image auto-annotation with latent space models

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Learning Object Categories from Google"s Image Search

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Canonical Correlation Analysis: An Overview with Application to Learning Methods

Neural Computation
Using KCCA for Japanese---English cross-language information retrieval and document classification

Journal of Intelligent Information Systems
Image retrieval: Ideas, influences, and trends of the new age

ACM Computing Surveys (CSUR)
World-scale mining of objects and events from community photo collections

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
A New Baseline for Image Annotation

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
Scene Discovery by Matrix Factorization

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part IV
Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Some Objects Are More Equal Than Others: Measuring and Predicting Importance

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Learning semantic distance from community-tagged media collection

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Every picture tells a story: generating sentences from images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV

A semantic model for cross-modal and multi-modal retrieval

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research
A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce an approach to image retrieval and auto-tagging that leverages the implicit information about object importance conveyed by the list of keyword tags a person supplies for an image. We propose an unsupervised learning procedure based on Kernel Canonical Correlation Analysis that discovers the relationship between how humans tag images (e.g., the order in which words are mentioned) and the relative importance of objects and their layout in the scene. Using this discovered connection, we show how to boost accuracy for novel queries, such that the search results better preserve the aspects a human may find most worth mentioning. We evaluate our approach on three datasets using either keyword tags or natural language descriptions, and quantify results with both ground truth parameters as well as direct tests with human subjects. Our results show clear improvements over approaches that either rely on image features alone, or that use words and image features but ignore the implied importance cues. Overall, our work provides a novel way to incorporate high-level human perception of scenes into visual representations for enhanced image search.