On cross-validation and stacking: building seemingly predictive models on random data

Authors:
Claudia Perlich;Grzegorz Świrszcz
Affiliations:
Media6, New York, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2011

Citing 7
Cited 1

Stacked regressions

Machine Learning
Beating the hold-out: bounds for K-fold and progressive cross-validation

COLT '99 Proceedings of the twelfth annual conference on Computational learning theory
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Preventing "Overfitting" of Cross-Validation Data

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Why Stacked Models Perform Effective Collective Classification

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

ACM SIGKDD Explorations Newsletter

Inactive learning?: difficulties employing active learning in practice

ACM SIGKDD Explorations Newsletter

Quantified Score

Hi-index	0.00

Visualization

Abstract

A number of times when using cross-validation (CV) while trying to do classification/probability estimation we have observed surprisingly low AUC's on real data with very few positive examples. AUC is the area under the ROC and measures the ranking ability and corresponds to the probability that a positive example receives a higher model score than a negative example. Intuition seems to suggest that no reasonable methodology should ever result in a model with an AUC significantly below 0.5. The focus of this paper is not on the estimator properties of CV (bias/variance/significance), but rather on the properties of the 'holdout' predictions based on which the CV performance of a model is calculated. We show that CV creates predictions that have an 'inverse' ranking with AUC well below 0.25 using features that were initially entirely unpredictive and models that can only perform monotonic transformations. In the extreme, combining CV with bagging (repeated averaging of out-of-sample predictions) generates 'holdout' predictions with perfectly opposite rankings on random data. While this would raise immediate suspicion upon inspection, we would like to caution the data mining community against using CV for stacking or in currently popular ensemble methods. They can reverse the predictions by assigning negative weights and produce in the end a model that appears to have close to perfect predictability while in reality the data was random.