Unsupervised temporal commonality discovery
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Recognizing complex events using large margin joint low-level event model
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
A survey of video datasets for human action and activity recognition
Computer Vision and Image Understanding
A comparative study of encoding, pooling and normalization methods for action recognition
ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part III
Learning latent spatio-temporal compositional model for human action recognition
Proceedings of the 21st ACM international conference on Multimedia
Segmental multi-way local pooling for video recognition
Proceedings of the 21st ACM international conference on Multimedia
Event identification in web social media through named entity recognition and topic modeling
Data & Knowledge Engineering
Activity representation with motion hierarchies
International Journal of Computer Vision
Hi-index | 0.00 |
In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model.