On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Authors:
Gavin C. Cawley;Nicola L.C. Talbot
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2010

Citing 33
Cited 7

Neural networks and the bias/variance dilemma

Neural Computation
Bayesian interpolation

Neural Computation
Bayesian Classification With Gaussian Processes

IEEE Transactions on Pattern Analysis and Machine Intelligence
Algorithmic stability and sanity-check bounds for leave-one-out cross-validation

Neural Computation
Soft Margins for AdaBoost

Machine Learning
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
Random Forests

Machine Learning
Choosing Multiple Parameters for Support Vector Machines

Machine Learning
Nonlinear Fisher discriminant analysis using a minimum squared error cost function and the orthogonal least squares algorithm

Neural Networks
Ridge Regression Learning Algorithm in Dual Variables

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Leave-One-Out Support Vector Machines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Constructing Descriptive and Discriminative Nonlinear Features: Rayleigh Coefficients in Kernel Feature Spaces

IEEE Transactions on Pattern Analysis and Machine Intelligence
Leave-one-out bounds for kernel methods

Neural Computation
Bayesian trigonometric support vector classifier

Neural Computation
Stability and generalization

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Predictive automatic relevance determination by expectation propagation

ICML '04 Proceedings of the twenty-first international conference on Machine learning
No Unbiased Estimator of the Variance of K-Fold Cross-Validation

The Journal of Machine Learning Research
Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)

Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics)
Feature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross Validation

Neural Computation
2005 Special Issue: Bayesian approach to feature selection and parameter tuning for support vector machine classifiers

Neural Networks - 2005 Special issue: IJCNN 2005
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)

Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
Kernel least-squares models using updates of the pseudoinverse

Neural Computation
Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression

Pattern Recognition
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

The Journal of Machine Learning Research
Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters

The Journal of Machine Learning Research
Efficient approximate leave-one-out cross-validation for kernel logistic regression

Machine Learning
Probabilistic classification vector machines

IEEE Transactions on Neural Networks
Model Selection: Beyond the Bayesian/Frequentist Divide

The Journal of Machine Learning Research
Learning pattern classification-a survey

IEEE Transactions on Information Theory
On the optimal parameter choice for ν-support vector machines

IEEE Transactions on Pattern Analysis and Machine Intelligence

Model Selection: Beyond the Bayesian/Frequentist Divide

The Journal of Machine Learning Research
Acute leukemia classification by ensemble particle swarm model selection

Artificial Intelligence in Medicine
Nyström approximate model selection for LSSVM

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Eigenvalues perturbation of integral operator for kernel selection

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Analysis of pattern recognition and dimensionality reduction techniques for odor biometrics

Knowledge-Based Systems
Bias-variance tradeoffs in program analysis

Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
An integrated Gaussian mixture model to estimate vigilance level based on EEG recordings

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds.