Framing image description as a ranking task: data, models and evaluation metrics

Authors:
Micah Hodosh;Peter Young;Julia Hockenmaier
Affiliations:
Department of Computer Science, University of Illinois, Urbana, IL;Department of Computer Science, University of Illinois, Urbana, IL;Department of Computer Science, University of Illinois, Urbana, IL
Venue:
Journal of Artificial Intelligence Research
Year:
2013

Citing 37
Cited 0

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A systematic comparison of various statistical alignment models

Computational Linguistics
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Kernel independent component analysis

The Journal of Machine Learning Research
Matching words and pictures

The Journal of Machine Learning Research
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
A Statistical Approach to Texture Classification from Single Images

International Journal of Computer Vision - Special Issue on Texture Analysis and Synthesis
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Canonical Correlation Analysis: An Overview with Application to Learning Methods

Neural Computation
Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Image retrieval: Ideas, influences, and trends of the new age

ACM Computing Surveys (CSUR)
A Discriminative Kernel-Based Approach to Rank Images from Text Queries

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Information Retrieval

Introduction to Information Retrieval
Tree kernels for semantic role labeling

Computational Linguistics
Inter-coder agreement for computational linguistics

Computational Linguistics
Syntactic and semantic kernels for short text pair categorization

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
An investigation into the validity of some metrics for automatically evaluating natural language generation systems

Computational Linguistics
Baselines for Image Annotation

International Journal of Computer Vision
Large scale image annotation: learning to rank with joint word-image embeddings

Machine Learning
How many words is a picture worth? Automatic caption generation for news images

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Collecting image annotations using Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
A new approach to cross-modal multimedia retrieval

Proceedings of the international conference on Multimedia
Every picture tells a story: generating sentences from images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Composing simple image descriptions using web-scale n-grams

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
A correlation approach for automatic image annotation

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Corpus-guided sentence generation of natural images

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Structured lexical similarity via convolution kernels on dependency trees

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Baby talk: Understanding and generating simple image descriptions

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search

International Journal of Computer Vision
Collective generation of natural image descriptions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ability to associate images with natural language sentences that describe what is depicted in them is a hallmark of image understanding, and a prerequisite for applications such as sentence-based image search. In analogy to image search, we propose to frame sentence-based image annotation as the task of ranking a given pool of captions. We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. We introduce a number of systems that perform quite well on this task, even though they are only based on features that can be obtained with minimal supervision. Our results clearly indicate the importance of training on multiple captions per image, and of capturing syntactic (word order-based) and semantic features of these captions. We also perform an in-depth comparison of human and automatic evaluation metrics for this task, and propose strategies for collecting human judgments cheaply and on a very large scale, allowing us to augment our collection with additional relevance judgments of which captions describe which image. Our analysis shows that metrics that consider the ranked list of results for each query image or sentence are significantly more robust than metrics that are based on a single response per query. Moreover, our study suggests that the evaluation of ranking-based image description systems may be fully automated.