3-D Shape Recovery Using Distributed Aspect Matching
IEEE Transactions on Pattern Analysis and Machine Intelligence - Special issue on interpretation of 3-D scenes—part II
Some advances in transformation-based part of speech tagging
AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Shock Graphs and Shape Matching
International Journal of Computer Vision
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
On the Representation and Matching of Qualitative Shape at Multiple Scales
ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part III
Viewpoint-Invariant Indexing for Content-Based Image Retrieval
CAIVD '98 Proceedings of the 1998 International Workshop on Content-Based Access of Image and Video Databases (CAIVD '98)
Combining Textual and Visual Cues for Content-Based Image Retrieval on the World Wide Web
CBAIVL '98 Proceedings of the IEEE Workshop on Content - Based Access of Image and Video Libraries
Robust analysis of feature spaces: color image segmentation
CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Generic Model Abstraction from Examples
IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning Visual Compound Models from Parallel Image-Text Datasets
Proceedings of the 30th DAGM symposium on Pattern Recognition
Hi-index | 0.00 |
We present on-going work on the topic of learning translation models between image data and text (English) captions. Most approaches to this problem assume a one-to-one or a flat, one-to-many mapping between a segmented image region and a word. However, this assumption is very restrictive from the computer vision standpoint, and fails to account for two important properties of image segmentation: 1) objects often consist of multiple parts, each captured by an individual region; and 2) individual regions are often over-segmented into multiple subregions. Moreover, this assumption also fails to capture the structural relations among words, e.g., part/whole relations. We outline a general framework that accommodates a many-to-many mapping between image regions and words, allowing for structured descriptions on both sides. In this paper, we describe our extensions to the probabilistic translation model of Brown et al. (1993) (as in Duygulu et al. (2002)) that enable the creation of structured models of image objects. We demonstrate our work in progress, in which a set of annotated images is used to derive a set of labeled, structured descriptions in the presence of oversegmentation.