Learning Visual Compound Models from Parallel Image-Text Datasets
Proceedings of the 30th DAGM symposium on Pattern Recognition
QI'11 Proceedings of the 5th international conference on Quantum interaction
Context-Aware Semi-Local Feature Detector
ACM Transactions on Intelligent Systems and Technology (TIST)
Discovering hierarchical object models from captioned images
Computer Vision and Image Understanding
Hi-index | 0.00 |
We address the problem of learning both the semantics (names) and the visual features (SIFT collections) of objects appearing in a training set of unstructured, captioned images of cluttered scenes. Prior work in applying machine translation models to learn the associations between image features and caption nouns has assumed a one-toone correspondence between features and nouns. However, each training image may contain thousands of SIFT features belonging to multiple objects. Our challenge is two-fold: 1) grouping the SIFT features into meaningful collections, and 2) learning the object names associated with those collections. Since better collections tend to have stronger associations with object names, we offer an integrated solution that uses the caption words to drive the feature grouping process. The result is a more general model acquisition framework that does not assume words correspond to individual features and does not require training images with isolated objects or unambiguous labels. The model that is learned performs well at labeling cluttered scenes in a set of test images.