Collective generation of natural image descriptions

Authors:
Polina Kuznetsova;Vicente Ordonez;Alexander C. Berg;Tamara L. Berg;Yejin Choi
Affiliations:
Stony Brook University, Stony Brook, NY;Stony Brook University, Stony Brook, NY;Stony Brook University, Stony Brook, NY;Stony Brook University, Stony Brook, NY;Stony Brook University, Stony Brook, NY
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Year:
2012

Citing 17
Cited 4

Recognizing Surfaces Using Three-Dimensional Textons

ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Constraint-based sentence compression an integer programming approach

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Summarization with a joint model for sentence extraction and compression

ILP '09 Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Automatic generation of story highlights

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
How many words is a picture worth? Automatic caption generation for news images

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Generating image descriptions using dependency relational patterns

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Title generation with quasi-synchronous grammar

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Every picture tells a story: generating sentences from images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Composing simple image descriptions using web-scale n-grams

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Systematically grounding language through vision in a deep, recurrent neural network

AGI'11 Proceedings of the 4th international conference on Artificial general intelligence
Corpus-guided sentence generation of natural images

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Baby talk: Understanding and generating simple image descriptions

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition

From image annotation to image description

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Automatic image description by using word-level features

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
A multimodal framework for unsupervised feature fusion

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web. More specifically, given a query image, we retrieve existing human-composed phrases used to describe visually similar images, then selectively combine those phrases to generate a novel description for the query image. We cast the generation process as constraint optimization problems, collectively incorporating multiple interconnected aspects of language composition for content planning, surface realization and discourse structure. Evaluation by human annotators indicates that our final system generates more semantically correct and linguistically appealing descriptions than two nontrivial baselines.