Invited talk: coupling deformable models and learning methods for nonverbal behavior analysis: applications to deception, multi-cultural studies and ASL

Authors:
Dimitris Metaxas
Affiliations:
CS, CBIM Center, Rutgers University
Venue:
ECCV'10 Proceedings of the 11th European conference on Trends and Topics in Computer Vision - Volume Part I
Year:
2010

Citing 5
Cited 0

Boosting encoded dynamic features for facial expression recognition

Pattern Recognition Letters
Learning with structured sparsity

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Spatial and temporal pyramids for grammatical expression recognition of American sign language

Proceedings of the 11th international ACM SIGACCESS conference on Computers and accessibility
Motion profiles for deception detection using visual cues

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
Ranking Model for Facial Age Estimation

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Based on recent advances in deformable model tracking theory, we have developed a novel system for real-time facial and gesture tracking and action recognition. In particular, our face tracker by using deformable statistical models that encode facial shape variation and local texture distribution, it robustly tracks 79 facial landmarks, which correspond to facial components such as the eyes, eyebrows, nose, and mouth. The model initializes automatically, tolerates partial occlusions, detects and recovers from lost track. Moreover, it handles head rotations of -90' to 90' in any direction by using manifold embedding methods. During online tracking, the model dynamically adapts to the facial shape of the current subject and temporal filters stochastically smooth the target's position. Tracked landmarks are then used by our learning modules for feature extraction and event recognition. In order to speed up convergence to the optimal landmark configuration, the system employs multi-resolution model fitting. To further reduce computational complexity, we track landmarks in successive frames using a Sum of Squared Differences point tracker and running the relatively "expensive" step of face search only periodically to prevent any error accumulation. This scheme allows us to have a measure of tracking success (confidence) for each landmark, so that we can detect early on if we are beginning to drift from the target, in which case we immediately invoke the deformable fitting algorithm to self-correct the result. Similarly, we have developed a skin blob tracker for tracking the orientation, position, velocity and area of head and hand blobs, which is automatically initialized with a generic skin color model, dynamically learning the specific subject's color distribution online for adaptive tracking. Detected blobs are filtered online, both in terms of shape and motion, using eigenspace analysis and temporal dynamical models to prune false detections. We apply this framework to three different recognition applications. First, we use the tracked facial landmarks to crop the face region and extract appearance features, which are used to learn models that detect universal facial expressions (i.e., sadness, anger, fear, disgust, happiness and surprise). In particular, our method utilizes the relative intensity ordering of facial expressions (i.e. neutral, onset, apex, offset) found in the training set to learn a ranking model (Rankboost) for recognition and intensity estimation, which improves our average recognition rate ( 87.5% on the CMU benchmark database). Second, we use the tracked landmarks and blobs to compute derived features (e.g., features characterizing posture openness, asymmetrical facial expressions, etc.) and recognized gestures (e.g., head touching, hands together, eye blinking, etc.). Using these features with discriminative learning methods, we train subject-specific models to detect when subjects from various cultures are deceptive or not in an interview scenario of a mock crime (12 responses per subject) as well as to identify cultural gestures. Using Leave-One-Out-Cross-Validation (LOOCV) we achieved average deception detection accuracy (percentage of correctly tagged responses) of 81.6% for 147 subjects. Third, we apply our tracking and learning methods to track signers of American Sign Language (ASL) and recognize gestures and expressions which have grammatical meaning. In particular, by tracking eyebrow movements, eye aperture changes, head tilts, head nods and head rotations, we can recognize wh-question markers, topic markers, and negation, using generative temporal models and spectral embedding methods to reduce feature dimensions and uncover the manifold separation (average accuracy 84% in continuous sequences).