Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope
International Journal of Computer Vision
Generating Natural Language Description of Human Behavior from Video Images
ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Context-based vision system for place and object recognition
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
Automatic evaluation of summaries using N-gram co-occurrence statistics
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Simple, robust, scalable semi-supervised learning via expectation regularization
Proceedings of the 24th international conference on Machine learning
A note on Platt's probabilistic outputs for support vector machines
Machine Learning
Learning from measurements in exponential families
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
UCNLG+Sum '09 Proceedings of the 2009 Workshop on Language Generation and Summarisation
Object Detection with Discriminatively Trained Part-Based Models
IEEE Transactions on Pattern Analysis and Machine Intelligence
A game-theoretic approach to generating spatial descriptions
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Every picture tells a story: generating sentences from images
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Midge: generating image descriptions from computer vision detections
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Apples to oranges: evaluating image annotations from natural language processing systems
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Describing video contents in natural language
HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
Collective generation of natural image descriptions
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Midge: generating descriptions of images
INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Synergistic methods for using language in robotics
Proceedings of the Workshop on Performance Metrics for Intelligent Systems
Efficient image annotation for automatic sentence generation
Proceedings of the 20th ACM international conference on Multimedia
From image annotation to image description
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Translating related words to videos and back through latent topics
Proceedings of the sixth ACM international conference on Web search and data mining
Exploiting language models to recognize unseen actions
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
A multimodal framework for unsupervised feature fusion
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Framing image description as a ranking task: data, models and evaluation metrics
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and descriptive sentences compared to naive strategies that use vision alone.