Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions

Authors:
Atsuhiro Kojima;Takeshi Tamura;Kunio Fukunaga
Affiliations:
Library and Science Information Center, Osaka Prefecture University, 1-1 Gakuen-cho, Sakai, Osaka 599-8531, Japan. ark@center.osakafu-u.ac.jp;Library and Science Information Center, Osaka Prefecture University, 1-1 Gakuen-cho, Sakai, Osaka 599-8531, Japan;Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuen-cho, Sakai, Osaka 599-8531, Japan
Venue:
International Journal of Computer Vision
Year:
2002

Citing 8
Cited 26

Integrating vision, motion and language through mind

Artificial Intelligence Review - Special issue on integration of natural language and vision processing: recent advances
Association of Motion Verbs with Vehicle Movements Extracted from Dense Optical Flow Fields

ECCV '94 Proceedings of the Third European Conference-Volume II on Computer Vision - Volume II
Media Information Processing in Documents -Generation of Manuals of Mechanical Parts Assembling

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Integrating Vision and Language: Towards Automatic Description of Human Movements

KI '95 Proceedings of the 19th Annual German Conference on Artificial Intelligence: Advances in Artificial Intelligence
Generation of Sketch Map Image and Its Instructions to Support the Understanding of Geographical Information

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Conceptual taxonomy of Japanese verbs for understanding natural language and picture patterns

COLING '80 Proceedings of the 8th conference on Computational linguistics
Feedback of correcting information in postediting to a machine translation system

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 2
Japanese-English translation through internal expressions

COLING '82 Proceedings of the 9th conference on Computational linguistics - Volume 1

Recognition of two-person interactions using a hierarchical Bayesian network

IWVS '03 First ACM SIGMM international workshop on Video surveillance
Steps toward a cognitive vision system

AI Magazine
Towards automatic analysis of social interaction patterns in a nursing home environment from video

Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Detecting social interactions of the elderly in a nursing home environment

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Free viewpoint action recognition using motion history volumes

Computer Vision and Image Understanding - Special issue on modeling people: Vision-based understanding of a person's shape, appearance, movement, and behaviour
Boosted string representation and its application to video surveillance

Pattern Recognition
Interpretation of complex situations in a semantic-based surveillance framework

Image Communication
Automatic Learning of Conceptual Knowledge in Image Sequences for Human Behavior Interpretation

IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
Natural Language Descriptions of Human Behavior from Video Sequences

KI '07 Proceedings of the 30th annual German conference on Advances in Artificial Intelligence
Multimedia ontology learning for automatic annotation and video browsing

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Understanding dynamic scenes based on human sequence evaluation

Image and Vision Computing
Automated visual surveillance in computer vision

AMTA'09 Proceedings of the 10th WSEAS international conference on Acoustics & music: theory & applications
Toward a Cooperative Recognition of Human Behaviors and Related Objects

Proceedings of the 2006 conference on Information Modelling and Knowledge Bases XVII
CASEE: a hierarchical event representation for the analysis of videos

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
A self-referential perceptual inference framework for video interpretation

ICVS'03 Proceedings of the 3rd international conference on Computer vision systems
How many words is a picture worth? Automatic caption generation for news images

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Vision, logic, and language - toward analyzable encompassing systems

KI'10 Proceedings of the 33rd annual German conference on Advances in artificial intelligence
A survey of vision-based methods for action representation, segmentation and recognition

Computer Vision and Image Understanding
Dimensionality reduction using a Gaussian Process Annealed Particle Filter for tracking and classification of articulated body motions

Computer Vision and Image Understanding
Augmenting video surveillance footage with virtual agents for incremental event evaluation

Pattern Recognition Letters
View-invariant modeling and recognition of human actions using grammars

WDV'05/WDV'06/ICCV'05/ECCV'06 Proceedings of the 2005/2006 international conference on Dynamical vision
Cognitive visual tracking and camera control

Computer Vision and Image Understanding
Complex activity representation and recognition by extended stochastic grammar

ACCV'06 Proceedings of the 7th Asian conference on Computer Vision - Volume Part I
From motion patterns to visual concepts for event analysis in dynamic scenes

ACCV'06 Proceedings of the 7th Asian conference on Computer Vision - Volume Part I
Describing video contents in natural language

HYBRID '12 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data
Automated textual descriptions for a wide range of video events with 48 human actions

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a method for describing human activities from video images based on concept hierarchies of actions. Major difficulty in transforming video images into textual descriptions is how to bridge a semantic gap between them, which is also known as inverse Hollywood problem. In general, the concepts of events or actions of human can be classified by semantic primitives. By associating these concepts with the semantic features extracted from video images, appropriate syntactic components such as verbs, objects, etc. are determined and then translated into natural language sentences. We also demonstrate the performance of the proposed method by several experiments.