Corpus-guided sentence generation of natural images

Authors:
Yezhou Yang;Ching Lik Teo;Hal Daumé, III;Yiannis Aloimonos
Affiliations:
University of Maryland Institute for Advanced Computer Studies, College Park, Maryland;University of Maryland Institute for Advanced Computer Studies, College Park, Maryland;University of Maryland Institute for Advanced Computer Studies, College Park, Maryland;University of Maryland Institute for Advanced Computer Studies, College Park, Maryland
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 12
Cited 13

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
Generating Natural Language Description of Human Behavior from Video Images

ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Context-based vision system for place and object recognition

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Simple, robust, scalable semi-supervised learning via expectation regularization

Proceedings of the 24th international conference on Machine learning
A note on Platt's probabilistic outputs for support vector machines

Machine Learning
Learning from measurements in exponential families

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Query-focused summarization using text-to-text generation: when information comes from multilingual sources

UCNLG+Sum '09 Proceedings of the 2009 Workshop on Language Generation and Summarisation
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
A game-theoretic approach to generating spatial descriptions

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Every picture tells a story: generating sentences from images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV

Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Apples to oranges: evaluating image annotations from natural language processing systems

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Detecting visual text

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Describing video contents in natural language

HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
Collective generation of natural image descriptions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Midge: generating descriptions of images

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Synergistic methods for using language in robotics

Proceedings of the Workshop on Performance Metrics for Intelligent Systems
Efficient image annotation for automatic sentence generation

Proceedings of the 20th ACM international conference on Multimedia
From image annotation to image description

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Translating related words to videos and back through latent topics

Proceedings of the sixth ACM international conference on Web search and data mining
Exploiting language models to recognize unseen actions

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
A multimodal framework for unsupervised feature fusion

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Framing image description as a ranking task: data, models and evaluation metrics

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and descriptive sentences compared to naive strategies that use vision alone.