Composing simple image descriptions using web-scale n-grams

Authors:
Siming Li;Girish Kulkarni;Tamara L. Berg;Alexander C. Berg;Yejin Choi
Affiliations:
Stony Brook University, NY;Stony Brook University, NY;Stony Brook University, NY;Stony Brook University, NY;Stony Brook University, NY
Venue:
CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Year:
2011

Citing 12
Cited 7

Learning Decision Rules by Randomized Iterative Local Search

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Matching words and pictures

The Journal of Machine Learning Research
NLP for indexing and retrieval of captioned photographs

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
DOGHED: a template-based generator for multimodal dialog systems targeting heterogeneous devices

NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Selecting sentences for multidocument summaries using randomized local search

AS '02 Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Topic models for image annotation and text illustration

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
How many words is a picture worth? Automatic caption generation for news images

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Generating image descriptions using dependency relational patterns

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Every picture tells a story: generating sentences from images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Baby talk: Understanding and generating simple image descriptions

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition

Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Detecting visual text

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Collective generation of natural image descriptions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Midge: generating descriptions of images

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Efficient image annotation for automatic sentence generation

Proceedings of the 20th ACM international conference on Multimedia
From image annotation to image description

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Studying natural language, and especially how people describe the world around them can help us better understand the visual world. In turn, it can also help us in the quest to generate natural language that describes this world in a human manner. We present a simple yet effective approach to automatically compose image descriptions given computer vision based inputs and using web-scale n-grams. Unlike most previous work that summarizes or retrieves pre-existing text relevant to an image, our method composes sentences entirely from scratch. Experimental results indicate that it is viable to generate simple textual descriptions that are pertinent to the specific content of an image, while permitting creativity in the description -- making for more human-like annotations than previous approaches.