Relation Between Permutation-Test P Values and Classifier Error Estimates

Authors:
Tailen Hsing;Sanju Attoor;Edward Dougherty
Affiliations:
Department of Statistics, Texas A&/M University, USA;Department of Electrical Engineering, Texas A&/M University, USA;Department of Electrical Engineering, Texas A&/M University, USA&semi/ Department of Pathology, University of Texas M.D. Anderson Cancer Center, USA. edward@ee.tamu.edu
Venue:
Machine Learning
Year:
2003

Citing 1
Cited 8

Class prediction and discovery using gene expression data

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology

Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

Journal of Biomedical Informatics - Special issue: Biomedical machine learning
Predicting cancer susceptibility from single-nucleotide polymorphism data: a case study in multiple myeloma

Proceedings of the 5th international workshop on Bioinformatics
Cross-validation and bootstrapping are unreliable in small sample classification

Pattern Recognition Letters
Data mining of vector–item patterns using neighborhood histograms

Knowledge and Information Systems
Point-distribution algorithm for mining vector-item patterns

Proceedings of the ACM SIGKDD Workshop on Useful Patterns
Permutation Tests for Studying Classifier Performance

The Journal of Machine Learning Research
An efficient clustering approach for large document collections

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Permutation tests for classification

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gene-expression-based classifiers suffer from the small number of microarrays usually available for classifier design. Hence, one is confronted with the dual problem of designing a classifier and estimating its error with only a small sample. Permutation testing has been recommended to assess the dependency of a designed classifier on the specific data set. This involves randomly permuting the labels of the data points, estimating the error of the designed classifiers for each permutation, and then finding the p value of the error for the actual labeling relative to the population of errors for the random labelings. This paper addresses the issue of whether or not this p value is informative. It provides both analytic and simulation results to show that the permutation p value is, up to very small deviation, a function of the error estimate. Moreover, even though the p value is a monotonically increasing function of the error estimate, in the range of the error where the majority of the p values lie, the function is very slowly increasing, so that inversion is problematic. Hence, the conclusion is that the p value is less informative than the error estimate. This result demonstrates that random labeling does not provide any further insight into the accuracy of the classifier or the precision of the error estimate. We have no knowledge beyond the error estimate itself and the various distribution-free, classifier-specific bounds developed for this estimate.