Learning the semantics of object-action relations by observation

Authors:
Eren Erdal Aksoy;Alexey Abramov;Johannes Dörr;Kejun Ning;Babette Dellen;Florentin Wörgötter
Affiliations:
Bernstein Center for Computational Neuroscience, University of Göttingen, III. Physikalisches Institut, Göttingen, Germany;Bernstein Center for Computational Neuroscience, University of Göttingen, III. Physikalisches Institut, Göttingen, Germany;Bernstein Center for Computational Neuroscience, University of Göttingen, III. Physikalisches Institut, Göttingen, Germany;Bernstein Center for Computational Neuroscience, University of Göttingen, III. Physikalisches Institut, Göttingen, Germany;Bernstein Center for Computational Neuroscience, University of Göttingen, III. Physikalisches Institut, Göttingen, Germany, Institut de Robòtica i Informática Industrial (CSIC- ...;Bernstein Center for Computational Neuroscience, University of Göttingen, III. Physikalisches Institut, Göttingen, Germany
Venue:
International Journal of Robotics Research
Year:
2011

Citing 23
Cited 3

The symbol grounding problem

CNLS '89 Proceedings of the ninth annual international conference of the Center for Nonlinear Studies on Self-organizing, Collective, and Cooperative Phenomena in Natural and Artificial Computing Networks on Emergent computation
Geometric invariance in computer vision

Geometric invariance in computer vision
Visual learning and recognition of 3-D objects from appearance

International Journal of Computer Vision
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
What is the set of images of an object under all possible lighting conditions?

CVPR '96 Proceedings of the 1996 Conference on Computer Vision and Pattern Recognition (CVPR '96)
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM

ICML '05 Proceedings of the 22nd international conference on Machine learning
Scalable Recognition with a Vocabulary Tree

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Incremental learning of gestures by imitation in a humanoid robot

Proceedings of the ACM/IEEE international conference on Human-robot interaction
Eigenfaces for recognition

Journal of Cognitive Neuroscience
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

International Journal of Computer Vision
Putting Objects in Perspective

International Journal of Computer Vision
Improving the recognition of interleaved activities

UbiComp '08 Proceedings of the 10th international conference on Ubiquitous computing
Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Cognitive agents - a procedural perspective relying on the predictability of Object-Action-Complexes (OACs)

Robotics and Autonomous Systems
Learning Functional Object-Categories from a Relational Spatio-Temporal Representation

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Multiple agent event detection and representation in videos

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 1
Location-based activity recognition using relational Markov networks

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Human action recognition in table-top scenarios: an HMM-based analysis to optimize the performance

CAIP'07 Proceedings of the 12th international conference on Computer analysis of images and patterns
Scene modelling and classification using learned spatial relations

COSIT'09 Proceedings of the 9th international conference on Spatial information theory
Action Recognition Using Mined Hierarchical Compound Features

IEEE Transactions on Pattern Analysis and Machine Intelligence
Extended hopfield network for sequence learning: application to gesture recognition

ICANN'05 Proceedings of the 15th international conference on Artificial Neural Networks: biological Inspirations - Volume Part I

A Novel Trajectory Generation Method for Robot Control

Journal of Intelligent and Robotic Systems
Tracking in object action space

Computer Vision and Image Understanding
Incremental object learning and robust tracking of multiple objects from RGB-D point set data

Journal of Visual Communication and Image Representation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recognizing manipulations performed by a human and the transfer and execution of this by a robot is a difficult problem. We address this in the current study by introducing a novel representation of the relations between objects at decisive time points during a manipulation. Thereby, we encode the essential changes in a visual scenery in a condensed way such that a robot can recognize and learn a manipulation without prior object knowledge. To achieve this we continuously track image segments in the video and construct a dynamic graph sequence. Topological transitions of those graphs occur whenever a spatial relation between some segments has changed in a discontinuous way and these moments are stored in a transition matrix called the semantic event chain (SEC). We demonstrate that these time points are highly descriptive for distinguishing between different manipulations. Employing simple sub-string search algorithms, SECs can be compared and type-similar manipulations can be recognized with high confidence. As the approach is generic, statistical learning can be used to find the archetypal SEC of a given manipulation class. The performance of the algorithm is demonstrated on a set of real videos showing hands manipulating various objects and performing different actions. In experiments with a robotic arm, we show that the SEC can be learned by observing human manipulations, transferred to a new scenario, and then reproduced by the machine.