Modeling complex temporal composition of actionlets for activity prediction
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part I
Human activities as stochastic kronecker graphs
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Propagative hough voting for human activity recognition
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Spatio-Temporal phrases for activity recognition
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Trajectory-Based modeling of human actions with motion reference points
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Weakly supervised learning of object segmentations from web-scale video
ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
Unsupervised temporal commonality discovery
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Discriminative prototype selection methods for graph embedding
Pattern Recognition
A comparative study of encoding, pooling and normalization methods for action recognition
ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part III
Vector field analysis for multi-object behavior modeling
Image and Vision Computing
Exploring dense trajectory feature and encoding methods for human interaction recognition
Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Learning latent spatio-temporal compositional model for human action recognition
Proceedings of the 21st ACM international conference on Multimedia
Analyzing growing plants from 4D point cloud data
ACM Transactions on Graphics (TOG)
Robust action recognition using local motion and group sparsity
Pattern Recognition
Max-Margin Early Event Detectors
International Journal of Computer Vision
Activity representation with motion hierarchies
International Journal of Computer Vision
Hi-index | 0.00 |
Complex human activities occurring in videos can be defined in terms of temporal configurations of primitive actions. Prior work typically hand-picks the primitives, their total number, and temporal relations (e.g., allow only followed-by), and then only estimates their relative significance for activity recognition. We advance prior work by learning what activity parts and their spatiotemporal relations should be captured to represent the activity, and how relevant they are for enabling efficient inference in realistic videos. We represent videos by spatiotemporal graphs, where nodes correspond to multiscale video segments, and edges capture their hierarchical, temporal, and spatial relationships. Access to video segments is provided by our new, multiscale segmenter. Given a set of training spatiotemporal graphs, we learn their archetype graph, and pdf's associated with model nodes and edges. The model adaptively learns from data relevant video segments and their relations, addressing the "what" and "how." Inference and learning are formulated within the same framework - that of a robust, least-squares optimization - which is invariant to arbitrary permutations of nodes in spatiotemporal graphs. The model is used for parsing new videos in terms of detecting and localizing relevant activity parts. We out-perform the state of the art on benchmark Olympic and UT human-interaction datasets, under a favorable complexity-vs.-accuracy trade-off.