Baby talk: Understanding and generating simple image descriptions

Authors:
G. Kulkarni;V. Premraj;S. Dhar; Siming Li; Yejin Choi;A. C. Berg;T. L. Berg
Affiliations:
Stony Brook Univ., Stony Brook, NY, USA;Stony Brook Univ., Stony Brook, NY, USA;Stony Brook Univ., Stony Brook, NY, USA;Stony Brook Univ., Stony Brook, NY, USA;Stony Brook Univ., Stony Brook, NY, USA;Stony Brook Univ., Stony Brook, NY, USA;Stony Brook Univ., Stony Brook, NY, USA
Venue:
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Year:
2011

Citing 0
Cited 20

Composing simple image descriptions using web-scale n-grams

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Understanding web images by object relation network

Proceedings of the 21st international conference on World Wide Web
Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Apples to oranges: evaluating image annotations from natural language processing systems

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Detecting visual text

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Distributional semantics in technicolor

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Collective generation of natural image descriptions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Midge: generating descriptions of images

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Efficient image annotation for automatic sentence generation

Proceedings of the 20th ACM international conference on Multimedia
Distributional semantics with eyes: using image analysis to improve computational representations of word meaning

Proceedings of the 20th ACM international conference on Multimedia
Describing clothing by semantic attributes

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Augmented attribute representations

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Image retrieval with structured object queries using latent ranking SVM

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
From image annotation to image description

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Image ranking via attribute boosted hypergraph

PCM'12 Proceedings of the 13th Pacific-Rim conference on Advances in Multimedia Information Processing
Unsupervised language learning for discovered visual concepts

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part IV
Automatic image description by using word-level features

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research
A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

We posit that visually descriptive language offers computer vision researchers both information about the world, and information about how people describe the world. The potential benefit from this source is made more significant due to the enormous amount of language data easily available today. We present a system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision. The system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.