Arm gesture variations during presentations are correlated with conjunctions indicating contrast

Authors:
John R. Zhang;John R. Kender
Affiliations:
Columbia University, New York, NY, USA;Columbia University, New York, NY, USA
Venue:
Proceedings of the 2012 ACM workshop on User experience in e-learning and augmented technologies in education
Year:
2012

Citing 7
Cited 0

Visual and linguistic information in gesture classification

Proceedings of the 6th international conference on Multimodal interfaces
VAST MM: multimedia browser for presentation video

Proceedings of the 6th ACM international conference on Image and video retrieval
Two-frame motion estimation based on polynomial expansion

SCIA'03 Proceedings of the 13th Scandinavian conference on Image analysis
Improving continuous gesture recognition with spoken prosody

CVPR'03 Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition
VastMM-Tag: a semantic tagging browser for unstructured videos

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Selecting the best faces to index presentation videos

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Articulated pose estimation with flexible mixtures-of-parts

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Studies in linguistics and psychology have long observed correlations between gestures and content in speech. We explore an aspect of this phenomena within the framework of the automatic classification of upper body gestures. We demonstrate a correlation between the variances of natural arm motions and the presence of those conjunctions that are used to contrast connected clauses ("but", "neither", etc.). We examine educational lectures automatically, by first modeling speaker head-torso-arms and then extracting statistical features of their image flows. An AdaBoost-based binary classifier using decision trees as weak learners classifies videos according to whether its speech content contains such conjunctions. Our database of 3.83 hours of video is segmented into 4243 clips, each with subtitles; speakers are of different ethnicities and genders, discussing a variety of subject matter. We show that training on the set of all conjunctions produces a classifier that performs no better than chance, but that training on sets of conjunctions indicating contrast are capable of achieving 55% accuracy on a balanced test set. We speculate that such gestures are used to emphasize underlying semantic complexity, and that such classifiers can be used in presentation video browsers to locate semantically significant video segments.