Detecting visual text

Authors:
Jesse Dodge;Amit Goyal;Xufeng Han;Alyssa Mensch;Margaret Mitchell;Karl Stratos;Kota Yamaguchi;Yejin Choi;Hal Daumé, III;Alexander C. Berg;Tamara L. Berg
Affiliations:
University of Washington;University of Maryland;Stony Brook University;MIT;Oregon Health & Science University;Columbia University;Stony Brook University;Stony Brook University;University of Maryland;Stony Brook University;Stony Brook University
Venue:
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Year:
2012

Citing 17
Cited 1

Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Geometric Context from a Single Image

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Learning as search optimization: approximate large margin methods for structured prediction

ICML '05 Proceedings of the 22nd international conference on Machine learning
A bootstrapping method for learning semantic lexicons using extraction pattern contexts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
The viability of web-derived polarity lexicons

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
How many words is a picture worth? Automatic caption generation for news images

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
Collecting image annotations using Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Automatic attribute discovery and characterization from noisy web data

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part I
Every picture tells a story: generating sentences from images

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Composing simple image descriptions using web-scale n-grams

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
The 2005 PASCAL visual object classes challenge

MLCW'05 Proceedings of the First international conference on Machine Learning Challenges: evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment
Corpus-guided sentence generation of natural images

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Baby talk: Understanding and generating simple image descriptions

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Midge: generating image descriptions from computer vision detections

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Picture tags and world knowledge: learning tag relations from visual semantic sources

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

When people describe a scene, they often include information that is not visually apparent; sometimes based on background knowledge, sometimes to tell a story. We aim to separate visual text---descriptions of what is being seen---from non-visual text in natural images and their descriptions. To do so, we first concretely define what it means to be visual, annotate visual text and then develop algorithms to automatically classify noun phrases as visual or non-visual. We find that using text alone, we are able to achieve high accuracies at this task, and that incorporating features derived from computer vision algorithms improves performance. Finally, we show that we can reliably mine visual nouns and adjectives from large corpora and that we can use these effectively in the classification task.