Learning latent temporal structure for complex event detection

Authors:
Kevin Tang
Affiliations:
Computer Science Department, Stanford University
Venue:
CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Year:
2012

Citing 0
Cited 8

Unsupervised temporal commonality discovery

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Recognizing complex events using large margin joint low-level event model

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
A survey of video datasets for human action and activity recognition

Computer Vision and Image Understanding
A comparative study of encoding, pooling and normalization methods for action recognition

ACCV'12 Proceedings of the 11th Asian conference on Computer Vision - Volume Part III
Learning latent spatio-temporal compositional model for human action recognition

Proceedings of the 21st ACM international conference on Multimedia
Segmental multi-way local pooling for video recognition

Proceedings of the 21st ACM international conference on Multimedia
Event identification in web social media through named entity recognition and topic modeling

Data & Knowledge Engineering
Activity representation with motion hierarchies

International Journal of Computer Vision

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model.