Sparse data and the effect of overfitting avoidance in decision tree induction

Authors:
Cullen Schaffer
Affiliations:
Department of Computer Science, CUNY, Hunter College, New York, NY
Venue:
AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
Year:
1992

Citing 7
Cited 1

Simplifying decision trees

International Journal of Man-Machine Studies - Special Issue: Knowledge Acquisition for Knowledge-based Systems. Part 5
Inferring decision trees using the minimum description length principle

Information and Computation
When does overfitting decrease prediction accuracy in induced decision trees and rule sets?

EWSL-91 Proceedings of the European working session on learning on Machine learning
On estimating probabilities in tree pruning

EWSL-91 Proceedings of the European working session on learning on Machine learning
Deconstructing the digit recognition problem

ML92 Proceedings of the ninth international workshop on Machine learning
Overfitting Avoidance as Bias

Machine Learning
An Empirical Comparison of Pruning Methods for Decision Tree Induction

Machine Learning

Don't care values in induction

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Overfitting avoidance in induction has often been treated as if it statistically increases expected predictive accuracy. In fact, there is no statistical basis for believing it will have this effect. Overfitting avoidance is simply a form of bias and, as such, its effect on expected accuracy depends, not on statistics, but on the degree to which this bias is appropriate to a problem-generating domain. This paper identifies one important factor that affects the degree to which the bias of overfitting avoidance is appropriate--the abundance of training data relative to the complexity of the relationship to be induced--and shows empirically how it determines whether such methods as pessimistic and cross-validated cost-complexity pruning will increase or decrease predictive accuracy in decision tree induction. The effect of sparse data is illustrated first in an artificial domain and then in more realistic examples drawn from the UCI machine learning database repository.