Multi-modal gesture recognition challenge 2013: dataset and results

Authors:
Sergio Escalera;Jordi Gonzàlez;Xavier Baró;Miguel Reyes;Oscar Lopes;Isabelle Guyon;Vassilis Athitsos;Hugo Escalante
Affiliations:
University of Barcelona & Computer Vision Center, Barcelona, Spain;Universitat Autonoma de Barcelona & Computer Vision Center, Barcelona, Spain;Universitat Oberta de Catalunya & Computer Vision Center, Barcelona, Spain;Universitat de Barcelona & Computer Vision Center, Barcelona, Spain;Computer Vision Center, Barcelona, Spain;ChaLearn & Berkeley, Berkeley, USA;University of Texas, Texas, USA;INAOE, Puebla, Mexico
Venue:
Proceedings of the 15th ACM on International conference on multimodal interaction
Year:
2013

Citing 3
Cited 2

The Pascal Visual Object Classes (VOC) Challenge

International Journal of Computer Vision
Scikit-learn: Machine Learning in Python

The Journal of Machine Learning Research
Real-time human pose recognition in parts from single depth images

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition

Multi-modal descriptors for multi-class hand pose recognition in human computer interaction systems

Proceedings of the 15th ACM on International conference on multimodal interaction
ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary

Proceedings of the 15th ACM on International conference on multimodal interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

The recognition of continuous natural gestures is a complex and challenging problem due to the multi-modal nature of involved visual cues (e.g. fingers and lips movements, subtle facial expressions, body pose, etc.), as well as technical limitations such as spatial and temporal resolution and unreliable depth cues. In order to promote the research advance on this field, we organized a challenge on multi-modal gesture recognition. We made available a large video database of 13,858 gestures from a lexicon of 20 Italian gesture categories recorded with a Kinect™ camera, providing the audio, skeletal model, user mask, RGB and depth images. The focus of the challenge was on user independent multiple gesture learning. There are no resting positions and the gestures are performed in continuous sequences lasting 1-2 minutes, containing between 8 and 20 gesture instances in each sequence. As a result, the dataset contains around 1.720.800 frames. In addition to the 20 main gesture categories, "distracter" gestures are included, meaning that additional audio and gestures out of the vocabulary are included. The final evaluation of the challenge was defined in terms of the Levenshtein edit distance, where the goal was to indicate the real order of gestures within the sequence. 54 international teams participated in the challenge, and outstanding results were obtained by the first ranked participants.