Comparing evaluation protocols on the KTH dataset

Authors:
Zan Gao;Ming-Yu Chen;Alexander G. Hauptmann;Anni Cai
Affiliations:
School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, P.R. China;School of Computer Science, Carnegie Mellon University, PA;School of Computer Science, Carnegie Mellon University, PA;School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, P.R. China
Venue:
HBU'10 Proceedings of the First international conference on Human behavior understanding
Year:
2010

Citing 14
Cited 8

The Recognition of Human Movement Using Temporal Templates

IEEE Transactions on Pattern Analysis and Machine Intelligence
Human Motion Analysis: A Review

NAM '97 Proceedings of the 1997 IEEE Workshop on Motion of Non-Rigid and Articulated Objects (NAM '97)
Space-time Interest Points

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Recognizing Action at a Distance

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Recognizing Human Actions: A Local SVM Approach

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03
Behavior recognition via sparse spatio-temporal features

ICCCN '05 Proceedings of the 14th International Conference on Computer Communications and Networks
Actions as Space-Time Shapes

IEEE Transactions on Pattern Analysis and Machine Intelligence
Space-Time Behavior-Based Correlation—OR—How to Tell If Two Underlying Motion Fields Are Similar Without Computing Them?

IEEE Transactions on Pattern Analysis and Machine Intelligence
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

International Journal of Computer Vision
Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part II
Shape representation and classification using the Poisson equation

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
A survey on visual surveillance of object motion and behaviors

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Learning and Matching of Dynamic Shape Manifolds for Human Action Recognition

IEEE Transactions on Image Processing

Challenges of human behavior understanding

HBU'10 Proceedings of the First international conference on Human behavior understanding
Sequential deep learning for human action recognition

HBU'11 Proceedings of the Second international conference on Human Behavior Unterstanding
Real-Time exact graph matching with application in human action recognition

HBU'12 Proceedings of the Third international conference on Human Behavior Understanding
Human action recognition in video by fusion of structural and spatio-temporal features

SSPR'12/SPR'12 Proceedings of the 2012 Joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
A line based pose representation for human action recognition

Image Communication
Exploring trace transform for robust human action recognition

Pattern Recognition
Human activity modeling by spatio temporal textural appearance

Pattern Recognition Letters
Hierarchical Hidden Markov Model in detecting activities of daily living in wearable videos for studies of dementia

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Human action recognition has become a hot research topic, and a lot of algorithms have been proposed. Most of researchers evaluated their performances on the KTH dataset, but there is no unified standard how to evaluate algorithms on this dataset. Different researchers have employed different test setups, so the comparison is not accurate, fair or complete. In order to know how much difference there is when different experimental setups are used, we take our own spatio-temporal MoSIFT feature as an example to assess its performance on the KTH dataset using different test scenarios and different partitioning of the data. In all experiments, support vector machine (SVM) with a chi-square kernel is adopted. First, we evaluate performance changes resulting from differing vocabulary sizes of the codebook, and then decide on a suitable vocabulary size of codebook. Then, we train the models using different training dataset partitions, and test the performances one the corresponding held-out test sets. Experiments show that the best performance of MoSIFT can reach 96.33% on the KTH dataset. When different n-fold cross-validation methods are used, there can be up to 10.67% difference in the result. And when different dataset segmentations are used (such as KTH1 and KTH2), the difference in results can be up to 5.8% absolute. In addition, the performance changes dramatically when different scenarios are used in the training and test dataset. When training on KTH1 S1+S2+S3+S4 and testing on KTH1 S1 and S3 scenarios, the performance can reach 97.33% and 89.33% respectively. This paper shows how different test configurations can skew results, even on standard data set. The recommendation is to use a simple leave-one-out as the most easily replicable clear-cut partitioning.