Classification of Anti-learnable Biological and Synthetic Data

Authors:
Adam Kowalczyk
Affiliations:
National ICT Australia and, Department of Electrical & Electronic Engineering, The University of Melbourne, Parkville, Vic. 3010, Australia
Venue:
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Year:
2007

Citing 9
Cited 1

Additive versus exponentiated gradient updates for linear prediction

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Robust Classification for Imprecise Environments

Machine Learning
One class SVM for yeast regulation prediction

ACM SIGKDD Explorations Newsletter
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The lack of a priori distinctions between learning algorithms

Neural Computation
An analysis of the anti-learning phenomenon for the class symmetric polyhedron

ALT'05 Proceedings of the 16th international conference on Algorithmic Learning Theory
An efficient alternative to SVM based recursive feature elimination with applications in natural language processing and bioinformatics

AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence

Non-parametric detection of meaningless distances in high dimensional data

Statistics and Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We demonstrate a binary classification problem in which standard supervised learning algorithms such as linear and kernel SVM, naive Bayes, ridge regression, k-nearest neighbors, shrunken centroid, multilayer perceptron and decision trees perform in an unusual way. On certain data sets they classify a randomly sampled training subset nearly perfectly, but systematically perform worse than random guessing on cases unseen in training. We demonstrate this phenomenon in classification of a natural data set of cancer genomics microarrays using cross-validation test. Additionally, we generate a range of synthetic datasets, the outcomes of 0-sum games, for which we analyse this phenomenon in the i.i.d. setting.Furthermore, we propose and evaluate a remedy that yields promising results for classifying such data as well as normal datasets. We simply transform the classifier scores by an additional 1-dimensional linear transformation developed, for instance, to maximize classification accuracy of the outputs of an internal cross-validation on the training set. We also discuss the relevance to other fields such as learning theory, boosting, regularization, sample bias and application of kernels.