Machine Learning
Neural Computation
Web Image Retrieval Re-Ranking with Relevance Model
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Recognizing Human Actions: A Local SVM Approach
ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Histograms of Oriented Gradients for Human Detection
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Learning structural SVMs with latent variables
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Automatic video tagging using content redundancy
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Towards google challenge: combining contextual and social information for web video categorization
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Enrichment and Ranking of the YouTube Tag Space and Integration with the Linked Data Cloud
ISWC '09 Proceedings of the 8th International Semantic Web Conference
Learning social tag relevance by neighbor voting
IEEE Transactions on Multimedia
Learning Tags from Unsegmented Videos of Multiple Human Actions
ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
On the Annotation of Web Videos by Efficient Near-Duplicate Search
IEEE Transactions on Multimedia
Hi-index | 0.10 |
We exploit textual information for recognizing human-human interaction activities in YouTube videos. YouTube videos are generally accompanied by various types of textual information, such as title, description, and tags. In particular, since some of the tags describe the visual content of the video, making good use of tags can aid activity recognition in the video. The proposed method uses two-fold information for activity recognition: (i) visual information: correlations among activities, human poses, configurations of human body parts, and image features extracted from visual content and (ii) textual information: correlations with activities extracted from tags. For tag analysis we discover a set of relevant tags and extract the meaningful words. Correlations between words and activities are learned from expanded tags obtained from tags of related videos. We develop a model that jointly captures two-fold information for activity recognition. We consider the model as a structured learning task with latent variables, and estimate the parameters of the model by using a non-convex minimization procedure. The proposed approach is evaluated using a dataset that consists of highly challenging real world videos and their assigned tags collected from YouTube. Experimental results demonstrate that by exploiting the visual and textual information in a structured framework, the proposed method can significantly improve the activity recognition results.