Is cross-validation valid for small-sample microarray classification?

Authors:
Ulisses M. Braga-Neto;Edward R. Dougherty
Affiliations:
Section of Clinical Cancer Genetics;Department of Pathology, University of Texas MD Anderson Cancer Center, Houston, TX, USA
Venue:
Bioinformatics
Year:
2004

Citing 0
Cited 57

A primer on gene expression and microarrays for machine learning researchers

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Optimal convex error estimators for classification

Pattern Recognition
Selecting features in microarray classification using ROC curves

Pattern Recognition
Multiclass Cancer Classification Using Semisupervised Ellipsoid ARTMAP and Particle Swarm Optimization with Gene Expression Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A tree-based decision rule for identifying profile groups of cases without predefined classes: application in diffuse large B-cell lymphomas

Computers in Biology and Medicine
Markov blanket-embedded genetic algorithm for gene selection

Pattern Recognition
Normalization benefits microarray-based classification

EURASIP Journal on Bioinformatics and Systems Biology
Quantification of the impact of feature selection on the variance of cross-validation error estimation

EURASIP Journal on Bioinformatics and Systems Biology
Gene expression profile class prediction using linear Bayesian classifiers

Computers in Biology and Medicine
Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Decorrelation of the true and estimated classifier errors in high-dimensional settings

EURASIP Journal on Bioinformatics and Systems Biology
Classification tree based protein structure distances for testing sequence-structure correlation

Computers in Biology and Medicine
Detecting reliable gene interactions by a hierarchy of Bayesian network classifiers

Computer Methods and Programs in Biomedicine
The peaking phenomenon in the presence of feature-selection

Pattern Recognition Letters
Which is better: holdout or full-sample classifier design?

EURASIP Journal on Bioinformatics and Systems Biology
Cross-validation and bootstrapping are unreliable in small sample classification

Pattern Recognition Letters
Memetic Algorithms for Feature Selection on Microarray Data

ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Advances in Neural Networks
Modified linear discriminant analysis approaches for classification of high-dimensional microarray data

Computational Statistics & Data Analysis
Conditioning-Based Modeling of Contextual Genomic Regulation

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap

Computational Statistics & Data Analysis
A memetic algorithm for gene selection and molecular classification of cancer

Proceedings of the 11th Annual conference on Genetic and evolutionary computation
Microarray analysis of autoimmune diseases by machine learning procedures

IEEE Transactions on Information Technology in Biomedicine
Exact correlation between actual and estimated errors in discrete classification

Pattern Recognition Letters
Determination of the optimal number of features for quadratic discriminant analysis via the normal approximation to the discriminant distribution

Pattern Recognition
Editorial: The fundamental role of pattern recognition for gene-expression/microarray data in bioinformatics

Pattern Recognition
Impact of error estimation on feature selection

Pattern Recognition
Exact performance of error estimators for discrete classifiers

Pattern Recognition
Feature selection algorithms to find strong genes

Pattern Recognition Letters
On the combination of dissimilarities for gene expression data analysis

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Ensemble of dissimilarity based classifiers for cancerous samples classification

PRIB'07 Proceedings of the 2nd IAPR international conference on Pattern recognition in bioinformatics
Correlation-based relevancy and redundancy measures for efficient gene selection

PRIB'07 Proceedings of the 2nd IAPR international conference on Pattern recognition in bioinformatics
Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data

Computers in Biology and Medicine
Identification of Full and Partial Class Relevant Genes

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Improving Bayesian credibility intervals for classifier error rates using maximum entropy empirical priors

Artificial Intelligence in Medicine
Towards a memetic feature selection paradigm

IEEE Computational Intelligence Magazine
Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis

IEEE Transactions on Information Theory - Special issue on information theory in molecular biology and neuroscience
A hybrid LDA and genetic algorithm for gene selection and classification of microarray data

Neurocomputing
Permutation Tests for Studying Classifier Performance

The Journal of Machine Learning Research
Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases

The Journal of Machine Learning Research
Semi-supervised approach for finding cancer sub-classes on gene expression data

BSB'10 Proceedings of the Advances in bioinformatics and computational biology, and 5th Brazilian conference on Bioinformatics
Recursive Mahalanobis Separability Measure for Gene Subset Selection

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An experimental comparison of cross-validation techniques for estimating the area under the ROC curve

Computational Statistics & Data Analysis
Regularized logistic regression without a penalty term: An application to cancer classification with microarray data

Expert Systems with Applications: An International Journal
Small-sample error estimation for bagged classification rules

EURASIP Journal on Advances in Signal Processing - Special issue on genomic signal processing
Robust Feature Selection for Microarray Data Based on Multicriterion Fusion

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Combining feature selection and feature construction to improve concept learning for high dimensional data

SARA'05 Proceedings of the 6th international conference on Abstraction, Reformulation and Approximation
Machine learning techniques and mammographic risk assessment

IWDM'10 Proceedings of the 10th international conference on Digital Mammography
Classifying high-dimensional patterns using a fuzzy logic discriminant network

Advances in Fuzzy Systems - Special issue on Hybrid Biomedical Intelligent Systems
Multi-objective learning of Relevance Vector Machine classifiers with multi-resolution kernels

Pattern Recognition
Biclustering-driven ensemble of Bayesian belief network classifiers for underdetermined problems

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Classification of emerging extreme event tracks in multivariate spatio-temporal physical systems using dynamic network structures: application to hurricane track prediction

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Exact representation of the second-order moments for resubstitution and leave-one-out error estimation for linear discriminant analysis in the univariate heteroskedastic Gaussian model

Pattern Recognition
The reliability of estimated confidence intervals for classification error rates when only a single sample is available

Pattern Recognition
Rademacher complexity and structural risk minimization: an application to human gene expression datasets

ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part II
Optimal classifiers with minimum expected error within a Bayesian framework-Part I: Discrete and Gaussian models

Pattern Recognition
Relationship between the accuracy of classifier error estimation and complexity of decision boundary

Pattern Recognition
Module-based breast cancer classification

International Journal of Data Mining and Bioinformatics

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules---linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)---using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.