Every picture tells a story: generating sentences from images

Authors:
Ali Farhadi;Mohsen Hejrati;Mohammad Amin Sadeghi;Peter Young;Cyrus Rashtchian;Julia Hockenmaier;David Forsyth
Affiliations:
Computer Science Department, University of Illinois at Urbana-Champaign;Computer Vision Group, School of Mathematics, Institute for studies in theoretical Physics and Mathematics;Computer Vision Group, School of Mathematics, Institute for studies in theoretical Physics and Mathematics;Computer Science Department, University of Illinois at Urbana-Champaign;Computer Science Department, University of Illinois at Urbana-Champaign;Computer Science Department, University of Illinois at Urbana-Champaign;Computer Science Department, University of Illinois at Urbana-Champaign
Venue:
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Year:
2010

Citing 11
Cited 22

WordsEye: an automatic text-to-scene conversion system

Proceedings of the 28th annual conference on Computer graphics and interactive techniques
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Meta-Analysis of Face Recognition Algorithms

FGR '02 Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition
Content-based image retrieval: approaches and trends of the new age

Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval
Learning structured prediction models: a large margin approach

ICML '05 Proceedings of the 22nd international conference on Machine learning
Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Improving People Search Using Query Expansions

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Linguistically motivated large-scale NLP with C&C and boxer

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Collecting image annotations using Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk

Composing simple image descriptions using web-scale n-grams

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Automatic sentence generation from images

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Understanding images with natural sentences

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Corpus-guided sentence generation of natural images

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning to summarize web image and text mutually

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Detecting visual text

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search

International Journal of Computer Vision
Distributional semantics in technicolor

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Collective generation of natural image descriptions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Midge: generating descriptions of images

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Synergistic methods for using language in robotics

Proceedings of the Workshop on Performance Metrics for Intelligent Systems
Efficient image annotation for automatic sentence generation

Proceedings of the 20th ACM international conference on Multimedia
Distributional semantics with eyes: using image analysis to improve computational representations of word meaning

Proceedings of the 20th ACM international conference on Multimedia
Describing clothing by semantic attributes

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Image retrieval with structured object queries using latent ranking SVM

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
From image annotation to image description

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Exploiting language models to recognize unseen actions

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Automatic image description by using word-level features

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
3D Wikipedia: using online text to automatically label and navigate reconstructed geometry

ACM Transactions on Graphics (TOG)
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research
A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

Humans can prepare concise descriptions of pictures, focusing on what they find important. We demonstrate that automatic methods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned us-ingdata. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.