Dynamic bayesian networks: representation, inference and learning
Dynamic bayesian networks: representation, inference and learning
International Journal of Computer Vision
Everyware: The Dawning Age of Ubiquitous Computing
Everyware: The Dawning Age of Ubiquitous Computing
Gesture spotting with body-worn inertial sensors to detect user activities
Pattern Recognition
Pedestrian Detection via Classification on Riemannian Manifolds
IEEE Transactions on Pattern Analysis and Machine Intelligence
Score normalization in multimodal biometric systems
Pattern Recognition
A survey of vision-based methods for action representation, segmentation and recognition
Computer Vision and Image Understanding
Normalization of the Speech Modulation Spectra for Robust Speech Recognition
IEEE Transactions on Audio, Speech, and Language Processing
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Spatio-temporal covariance descriptors for action and gesture recognition
WACV '13 Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV)
Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Hi-index | 0.00 |
This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2,742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.