Watch, Listen & Learn: Co-training on Captioned Images and Videos

Authors:
Sonal Gupta;Joohyun Kim;Kristen Grauman;Raymond Mooney
Affiliations:
Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233;Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233;Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233;Department of Computer Sciences, The University of Texas at Austin, Austin, U.S.A 78712-0233
Venue:
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Year:
2008

Citing 27
Cited 5

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Semi-supervised support vector machines

Proceedings of the 1998 conference on Advances in neural information processing systems II
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Modern Information Retrieval

Modern Information Retrieval
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Email classification with co-training

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
Extracting Actors, Actions and Events from Sports Video - A Fundamental Approach to Story Tracking

ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Matching words and pictures

The Journal of Machine Learning Research
Unsupervised Improvement of Visual Detectors using Co-Training

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Recognizing Action at a Distance

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Scale & Affine Invariant Interest Point Detectors

International Journal of Computer Vision
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories

CVPRW '04 Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'04) Volume 12 - Volume 12
Semi-Supervised Self-Training of Object Detection Models

WACV-MOTION '05 Proceedings of the Seventh IEEE Workshops on Application of Computer Vision (WACV/MOTION'05) - Volume 1 - Volume 01
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
On Space-Time Interest Points

International Journal of Computer Vision
Actions as Space-Time Shapes

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Co-Adaptation of audio-visual speech and gesture classifiers

Proceedings of the 8th international conference on Multimodal interfaces
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Rapid and brief communication: Active learning for image retrieval with Co-SVM

Pattern Recognition
TV ad video categorization with probabilistic latent concept learning

Proceedings of the international workshop on Workshop on multimedia information retrieval
Situated models of meaning for sports video retrieval

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Semi-latent Dirichlet allocation: a hierarchical model for human action recognition

Proceedings of the 2nd conference on Human motion: understanding, modeling, capture and animation

A Bayesian network modeling approach for cross media analysis

Image Communication
Bag-of-visual-words approach to abnormal image detection in wireless capsule endoscopy videos

ISVC'11 Proceedings of the 7th international conference on Advances in visual computing - Volume Part II
DCPE co-training for classification

Neurocomputing
Web page and image semi-supervised classification with heterogeneous information fusion

Journal of Information Science
Multi-view semi-supervised web image classification via co-graph

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recognizing visual scenes and activities is challenging: often visual cues alone are ambiguous, and it is expensive to obtain manually labeled examples from which to learn. To cope with these constraints, we propose to leverage the text that often accompanies visual data to learn robust models of scenes and actions from partially labeled collections. Our approach uses co-training, a semi-supervised learning method that accommodates multi-modal views of data. To classify images, our method learns from captioned images of natural scenes; and to recognize human actions, it learns from videos of athletic events with commentary. We show that by exploiting both multi-modal representations and unlabeled data our approach learns more accurate image and video classifiers than standard baseline algorithms.