Prediction error estimation: a comparison of resampling methods

Authors:
Annette M. Molinaro;Richard Simon;Ruth M. Pfeiffer
Affiliations:
Biostatistics Branch, Division of Cancer Epidemiology and Genetics, NCI, NIH Rockville, MD 20852 USA;Biometric Research Branch, Division of Cancer Treatment and Diagnostics, NCI, NIH Rockville, MD 20852 USA;Biostatistics Branch, Division of Cancer Epidemiology and Genetics, NCI, NIH Rockville, MD 20852 USA
Venue:
Bioinformatics
Year:
2005

Citing 0
Cited 37

Classification by ensembles from random partitions of high-dimensional data

Computational Statistics & Data Analysis
Support feature machine for classification of abnormal brain activity

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of the effects of Gabor filter parameters on texture classification

Pattern Recognition
Quantification of the impact of feature selection on the variance of cross-validation error estimation

EURASIP Journal on Bioinformatics and Systems Biology
Ensemble methods for classification of patients for personalized medicine with high-dimensional data

Artificial Intelligence in Medicine
Decorrelation of the true and estimated classifier errors in high-dimensional settings

EURASIP Journal on Bioinformatics and Systems Biology
A decision support system to facilitate management of patients with acute gastrointestinal bleeding

Artificial Intelligence in Medicine
Which is better: holdout or full-sample classifier design?

EURASIP Journal on Bioinformatics and Systems Biology
Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

The Journal of Machine Learning Research
Distribution modeling and simulation of gene expression data

Computational Statistics & Data Analysis
Modified linear discriminant analysis approaches for classification of high-dimensional microarray data

Computational Statistics & Data Analysis
Cancer informatics by prototype networks in mass spectrometry

Artificial Intelligence in Medicine
Irrelevant gene elimination for Partial Least Squares based Dimension Reduction by using feature probes

International Journal of Data Mining and Bioinformatics
Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap

Computational Statistics & Data Analysis
Boosting support vector machines using multiple dissimilarities

KES'07/WIRN'07 Proceedings of the 11th international conference, KES 2007 and XVII Italian workshop on neural networks conference on Knowledge-based intelligent information and engineering systems: Part I
On the combination of dissimilarities for gene expression data analysis

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Ensemble of dissimilarity based classifiers for cancerous samples classification

PRIB'07 Proceedings of the 2nd IAPR international conference on Pattern recognition in bioinformatics
Improving Bayesian credibility intervals for classifier error rates using maximum entropy empirical priors

Artificial Intelligence in Medicine
Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods

Computational Statistics & Data Analysis
Permutation Tests for Studying Classifier Performance

The Journal of Machine Learning Research
Brief communication: Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm

Computational Biology and Chemistry
Adaptive sparse polynomial chaos expansion based on least angle regression

Journal of Computational Physics
Design of information granule-oriented RBF neural networks and its application to power supply for high-field magnet

Engineering Applications of Artificial Intelligence
Investigation of bagging ensembles of genetic neural networks and fuzzy systems for real estate appraisal

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part II
An empirical approach to model selection through validation for censored survival data

Journal of Biomedical Informatics
Empirical comparison of resampling methods using genetic neural networks for a regression problem

HAIS'11 Proceedings of the 6th international conference on Hybrid artificial intelligent systems - Volume Part II
Experimental comparison of resampling methods in a multi-agent system to assist with property valuation

KES-AMSTA'11 Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications
The extraction method of DNA microarray features based on modified F statistics vs. classifier based on rough mereology

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Empirical comparison of resampling methods using genetic fuzzy systems for a regression problem

IDEAL'11 Proceedings of the 12th international conference on Intelligent data engineering and automated learning
The extraction method of DNA microarray features based on experimental A statistics

RSKT'11 Proceedings of the 6th international conference on Rough sets and knowledge technology
New results on minimum error entropy decision trees

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Resampling methods for meta-model validation with recommendations for evolutionary computation

Evolutionary Computation
A framework for automatic TRIZ level of invention estimation of patents using natural language processing, knowledge-transfer and patent citation metrics

Computer-Aided Design
The reliability of estimated confidence intervals for classification error rates when only a single sample is available

Pattern Recognition
Software effort models should be assessed via leave-one-out validation

Journal of Systems and Software
Reliable selection of the number of fascicles in diffusion images by estimation of the generalization error

IPMI'13 Proceedings of the 23rd international conference on Information Processing in Medical Imaging
Wastewater treatment plant performance prediction with support vector machines

ICDM'13 Proceedings of the 13th international conference on Advances in Data Mining: applications and theoretical aspects

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: In genomic studies, thousands of features are collected on relatively few samples. One of the goals of these studies is to build classifiers to predict the outcome of future observations. There are three inherent steps to this process: feature selection, model selection and prediction assessment. With a focus on prediction assessment, we compare several methods for estimating the 'true' prediction error of a prediction model in the presence of feature selection. Results: For small studies where features are selected from thousands of candidates, the resubstitution and simple split-sample estimates are seriously biased. In these small samples, leave-one-out cross-validation (LOOCV), 10-fold cross-validation (CV) and the .632+ bootstrap have the smallest bias for diagonal discriminant analysis, nearest neighbor and classification trees. LOOCV and 10-fold CV have the smallest bias for linear discriminant analysis. Additionally, LOOCV, 5- and 10-fold CV, and the .632+ bootstrap have the lowest mean square error. The .632+ bootstrap is quite biased in small sample sizes with strong signal-to-noise ratios. Differences in performance among resampling methods are reduced as the number of specimens available increase. Contact: annette.molinaro@yale.edu Supplementary Information: A complete compilation of results and R code for simulations and analyses are available in Molinaro et al. (2005) (http://linus.nci.nih.gov/brb/TechReport.htm).