A multi-modal gesture recognition system using audio, video, and skeletal joint data

Authors:
Karthik Nandakumar;Kong Wah Wan;Siu Man Alice Chan;Wen Zheng Terence Ng;Jian Gang Wang;Wei Yun Yau
Affiliations:
Institute for Infocomm Research, Singapore, Singapore;Institute for Infocomm Research, Singapore, Singapore;Institute for Infocomm Research, Singapore, Singapore;Institute for Infocomm Research, Singapore, Singapore;Institute for Infocomm Research, Singapore, Singapore;Institute for Infocomm Research, Singapore, Singapore
Venue:
Proceedings of the 15th ACM on International conference on multimodal interaction
Year:
2013

Citing 11
Cited 0

Dynamic bayesian networks: representation, inference and learning

Dynamic bayesian networks: representation, inference and learning
On Space-Time Interest Points

International Journal of Computer Vision
Everyware: The Dawning Age of Ubiquitous Computing

Everyware: The Dawning Age of Ubiquitous Computing
Gesture spotting with body-worn inertial sensors to detect user activities

Pattern Recognition
Pedestrian Detection via Classification on Riemannian Manifolds

IEEE Transactions on Pattern Analysis and Machine Intelligence
Score normalization in multimodal biometric systems

Pattern Recognition
A survey of vision-based methods for action representation, segmentation and recognition

Computer Vision and Image Understanding
Normalization of the Speech Modulation Spectra for Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Gesture Recognition: A Survey

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Spatio-temporal covariance descriptors for action and gesture recognition

WACV '13 Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV)
Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2,742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.