Recognizing human-human interaction activities using visual and textual information

Authors:
Sunyoung Cho;Sooyeong Kwak;Hyeran Byun
Affiliations:
Department of Computer Science, Yonsei University, Shinchon-Dong, Seodaemun-Gu, Seoul 120-749, Republic of Korea;Department of Electronic and Control Engineering, Hanbat National University, 125, Dongseo-daero, Yuseong-Gu, Daejeon 305-719, Republic of Korea;Department of Computer Science, Yonsei University, Shinchon-Dong, Seodaemun-Gu, Seoul 120-749, Republic of Korea
Venue:
Pattern Recognition Letters
Year:
2013

Citing 14
Cited 0

Random Forests

Machine Learning
The concave-convex procedure

Neural Computation
Web Image Retrieval Re-Ranking with Relevance Model

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Actions as Space-Time Shapes

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Learning structural SVMs with latent variables

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Automatic video tagging using content redundancy

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Towards google challenge: combining contextual and social information for web video categorization

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Enrichment and Ranking of the YouTube Tag Space and Integration with the Linked Data Cloud

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Learning social tag relevance by neighbor voting

IEEE Transactions on Multimedia
Learning Tags from Unsegmented Videos of Multiple Human Actions

ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
On the Annotation of Web Videos by Efficient Near-Duplicate Search

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.10

Visualization

Abstract

We exploit textual information for recognizing human-human interaction activities in YouTube videos. YouTube videos are generally accompanied by various types of textual information, such as title, description, and tags. In particular, since some of the tags describe the visual content of the video, making good use of tags can aid activity recognition in the video. The proposed method uses two-fold information for activity recognition: (i) visual information: correlations among activities, human poses, configurations of human body parts, and image features extracted from visual content and (ii) textual information: correlations with activities extracted from tags. For tag analysis we discover a set of relevant tags and extract the meaningful words. Correlations between words and activities are learned from expanded tags obtained from tags of related videos. We develop a model that jointly captures two-fold information for activity recognition. We consider the model as a structured learning task with latent variables, and estimate the parameters of the model by using a non-convex minimization procedure. The proposed approach is evaluated using a dataset that consists of highly challenging real world videos and their assigned tags collected from YouTube. Experimental results demonstrate that by exploiting the visual and textual information in a structured framework, the proposed method can significantly improve the activity recognition results.