Recognition using visual phrases

Authors:
M. A. Sadeghi;A. Farhadi
Affiliations:
Comput. Sci. Dept., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA;Comput. Sci. Dept., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Year:
2011

Citing 0
Cited 8

Efficient image annotation for automatic sentence generation

Proceedings of the 20th ACM international conference on Multimedia
Dog breed classification using part localization

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part I
Unsupervised discovery of mid-level discriminative patches

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Image retrieval with structured object queries using latent ranking SVM

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
Detecting actions, poses, and objects with relational phraselets

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Collective activity localization with contextual spatial pyramid

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III
Local context priors for object proposal generation

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part I
Learning a context aware dictionary for sparse representation

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce visual phrases, complex visual composites like "a person riding a horse". Visual phrases often display significantly reduced visual complexity compared to their component objects, because the appearance of those objects can change profoundly when they participate in relations. We introduce a dataset suitable for phrasal recognition that uses familiar PASCAL object categories, and demonstrate significant experimental gains resulting from exploiting visual phrases. We show that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects. We argue that any multi-class detection system must decode detector outputs to produce final results; this is usually done with non-maximum suppression. We describe a novel decoding procedure that can account accurately for local context without solving difficult inference problems. We show this decoding procedure outperforms the state of the art. Finally, we show that decoding a combination of phrasal and object detectors produces real improvements in detector results.