Scalability for clustering algorithms revisited
ACM SIGKDD Explorations Newsletter
Learning and making decisions when costs and probabilities are both unknown
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Transforming classifier scores into accurate multiclass probability estimates
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Editorial: Data Mining Lessons Learned
Machine Learning
An iterative method for multi-class cost-sensitive learning
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards parameter-free data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Compression-based data mining of sequential data
Data Mining and Knowledge Discovery
Data mining methodological weaknesses and suggested fixes
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Multi-class cost-sensitive boosting with p-norm loss functions
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards integrative causal analysis of heterogeneous data sets and studies
The Journal of Machine Learning Research
Hi-index | 0.00 |
CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very time-consuming model search. In either ease, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.