Visual and linguistic information in gesture classification
Proceedings of the 6th international conference on Multimodal interfaces
VAST MM: multimedia browser for presentation video
Proceedings of the 6th ACM international conference on Image and video retrieval
Two-frame motion estimation based on polynomial expansion
SCIA'03 Proceedings of the 13th Scandinavian conference on Image analysis
Improving continuous gesture recognition with spoken prosody
CVPR'03 Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition
VastMM-Tag: a semantic tagging browser for unstructured videos
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Selecting the best faces to index presentation videos
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Articulated pose estimation with flexible mixtures-of-parts
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Hi-index | 0.00 |
Studies in linguistics and psychology have long observed correlations between gestures and content in speech. We explore an aspect of this phenomena within the framework of the automatic classification of upper body gestures. We demonstrate a correlation between the variances of natural arm motions and the presence of those conjunctions that are used to contrast connected clauses ("but", "neither", etc.). We examine educational lectures automatically, by first modeling speaker head-torso-arms and then extracting statistical features of their image flows. An AdaBoost-based binary classifier using decision trees as weak learners classifies videos according to whether its speech content contains such conjunctions. Our database of 3.83 hours of video is segmented into 4243 clips, each with subtitles; speakers are of different ethnicities and genders, discussing a variety of subject matter. We show that training on the set of all conjunctions produces a classifier that performs no better than chance, but that training on sets of conjunctions indicating contrast are capable of achieving 55% accuracy on a balanced test set. We speculate that such gestures are used to emphasize underlying semantic complexity, and that such classifiers can be used in presentation video browsers to locate semantically significant video segments.