Magical thinking in data mining: lessons from CoIL challenge 2000

Authors:
Charles Elkan
Affiliations:
University of California, San Diego, La Jolla, California
Venue:
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2001

Citing 4
Cited 10

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Learning and making decisions when costs and probabilities are both unknown

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning

Transforming classifier scores into accurate multiclass probability estimates

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Editorial: Data Mining Lessons Learned

Machine Learning
A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000

Machine Learning
An iterative method for multi-class cost-sensitive learning

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
Data mining methodological weaknesses and suggested fixes

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Multi-class cost-sensitive boosting with p-norm loss functions

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards integrative causal analysis of heterogeneous data sets and studies

The Journal of Machine Learning Research
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very time-consuming model search. In either ease, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.