Watch, Listen & Learn: Co-training on Captioned Images and Videos

  • Authors:
  • Sonal Gupta;Joohyun Kim;Kristen Grauman;Raymond Mooney

  • Affiliations:
  • Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233;Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233;Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233;Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233

  • Venue:
  • ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recognizing visual scenes and activities is challenging: often visual cues alone are ambiguous, and it is expensive to obtain manually labeled examples from which to learn. To cope with these constraints, we propose to leverage the text that often accompanies visual data to learn robust models of scenes and actions from partially labeled collections. Our approach uses co-training, a semi-supervised learning method that accommodates multi-modal views of data. To classify images, our method learns from captioned images of natural scenes; and to recognize human actions, it learns from videos of athletic events with commentary. We show that by exploiting both multi-modal representations and unlabeled data our approach learns more accurate image and video classifiers than standard baseline algorithms.