Class prediction and discovery using gene expression data
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Induction of comprehensible models for gene expression datasets by subgroup discovery methodology
Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Proceedings of the 5th international workshop on Bioinformatics
Cross-validation and bootstrapping are unreliable in small sample classification
Pattern Recognition Letters
Data mining of vector–item patterns using neighborhood histograms
Knowledge and Information Systems
Point-distribution algorithm for mining vector-item patterns
Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Permutation Tests for Studying Classifier Performance
The Journal of Machine Learning Research
An efficient clustering approach for large document collections
ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Permutation tests for classification
COLT'05 Proceedings of the 18th annual conference on Learning Theory
Hi-index | 0.00 |
Gene-expression-based classifiers suffer from the small number of microarrays usually available for classifier design. Hence, one is confronted with the dual problem of designing a classifier and estimating its error with only a small sample. Permutation testing has been recommended to assess the dependency of a designed classifier on the specific data set. This involves randomly permuting the labels of the data points, estimating the error of the designed classifiers for each permutation, and then finding the p value of the error for the actual labeling relative to the population of errors for the random labelings. This paper addresses the issue of whether or not this p value is informative. It provides both analytic and simulation results to show that the permutation p value is, up to very small deviation, a function of the error estimate. Moreover, even though the p value is a monotonically increasing function of the error estimate, in the range of the error where the majority of the p values lie, the function is very slowly increasing, so that inversion is problematic. Hence, the conclusion is that the p value is less informative than the error estimate. This result demonstrates that random labeling does not provide any further insight into the accuracy of the classifier or the precision of the error estimate. We have no knowledge beyond the error estimate itself and the various distribution-free, classifier-specific bounds developed for this estimate.