Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation

Authors:
Juan D. Rodriguez;Aritz Perez;Jose A. Lozano
Affiliations:
University of the Basque Country, San Seabstian;University of the Basque Country, San Sebastian-Donostia;University of the Basque Country, San Sebastian-Donostia
Venue:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Year:
2010

Citing 0
Cited 11

A predication survival model for colorectal cancer

AMERICAN-MATH'11/CEA'11 Proceedings of the 2011 American conference on applied mathematics and the 5th WSEAS international conference on Computer engineering and applications
Experimental comparison of resampling methods in a multi-agent system to assist with property valuation

KES-AMSTA'11 Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications
Texture and color analysis for the automatic classification of the eye lipid layer

IWANN'11 Proceedings of the 11th international conference on Artificial neural networks conference on Advances in computational intelligence - Volume Part II
Approaching Sentiment Analysis by using semi-supervised learning of multi-dimensional classifiers

Neurocomputing
Perceptual relativity-based local hyperplane classification

Neurocomputing
Sensitivity analysis with cross-validation for feature selection and manifold learning

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part I
A general framework for the statistical analysis of the sources of variance for classification error estimators

Pattern Recognition
Supervised pre-processing approaches in multiple class variables classification for fish recruitment forecasting

Environmental Modelling & Software
Hybrid e-regression and validation soft computing techniques: The case of wood dielectric loss factor

Neurocomputing
Automatic classification of the interferential tear film lipid layer using colour texture analysis

Computer Methods and Programs in Biomedicine
VILO: a rapid learning nearest-neighbor classifier for malware triage

Journal in Computer Virology

Quantified Score

Hi-index	0.14

Visualization

Abstract

In the machine learning field, the performance of a classifier is usually measured in terms of prediction error. In most real-world problems, the error cannot be exactly calculated and it must be estimated. Therefore, it is important to choose an appropriate estimator of the error. This paper analyzes the statistical properties, bias and variance, of the k-fold cross-validation classification error estimator (k-cv). Our main contribution is a novel theoretical decomposition of the variance of the k-cv considering its sources of variance: sensitivity to changes in the training set and sensitivity to changes in the folds. The paper also compares the bias and variance of the estimator for different values of k. The experimental study has been performed in artificial domains because they allow the exact computation of the implied quantities and we can rigorously specify the conditions of experimentation. The experimentation has been performed for two classifiers (naive Bayes and nearest neighbor), different numbers of folds, sample sizes, and training sets coming from assorted probability distributions. We conclude by including some practical recommendation on the use of k-fold cross validation.