Convolutional learning of spatio-temporal features

Authors:
Graham W. Taylor;Rob Fergus;Yann LeCun;Christoph Bregler
Affiliations:
Courant Institute of Mathematical Sciences, New York University, New York;Courant Institute of Mathematical Sciences, New York University, New York;Courant Institute of Mathematical Sciences, New York University, New York;Courant Institute of Mathematical Sciences, New York University, New York
Venue:
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part VI
Year:
2010

Citing 12
Cited 12

Training products of experts by minimizing contrastive divergence

Neural Computation
Space-time Interest Points

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
A fast learning algorithm for deep belief nets

Neural Computation
An empirical evaluation of deep architectures on problems with many factors of variation

Proceedings of the 24th international conference on Machine learning
An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Online dictionary learning for sparse coding

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Recursive Sparse, Spatiotemporal Coding

ISM '09 Proceedings of the 2009 11th IEEE International Symposium on Multimedia
Learning to represent spatial transformations with factored higher-order boltzmann machines

Neural Computation
Learning methods for generic object recognition with invariance to pose and lighting

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Multiscale conditional random fields for image labeling

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

Bilinear deep learning for image classification

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Sequential deep learning for human action recognition

HBU'11 Proceedings of the Second international conference on Human Behavior Unterstanding
Letters: Learning spatiotemporal features by using independent component analysis with application to facial expression recognition

Neurocomputing
Sparse Modeling of Human Actions from Motion Imagery

International Journal of Computer Vision
Multi-channel shape-flow kernel descriptors for robust video event detection and retrieval

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part II
Complex events detection using data-driven concepts

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part III
Trajectory-Based modeling of human actions with motion reference points

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part V
Disentangling factors of variation for facial expression recognition

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
Atomic action features: a new feature for action recognition

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
Learning invariant feature hierarchies

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part I
A local descriptor based on Laplacian pyramid coding for action recognition

Pattern Recognition Letters
Robust action recognition using local motion and group sparsity

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of learning good features for understanding video data. We introduce a model that learns latent representations of image sequences from pairs of successive images. The convolutional architecture of our model allows it to scale to realistic image sizes whilst using a compact parametrization. In experiments on the NORB dataset, we show our model extracts latent "flow fields" which correspond to the transformation between the pair of input frames. We also use our model to extract low-level motion features in a multi-stage architecture for action recognition, demonstrating competitive performance on both the KTH and Hollywood2 datasets.