Data mining criteria for tree-based regression and classification

Authors:
Andreas Buja;Yung-Seop Lee
Affiliations:
AT&T Labs, Florham Park, NJ;Dongguk University, Korea
Venue:
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2001

Citing 3
Cited 7

C4.5: programs for machine learning

C4.5: programs for machine learning
Technical note: some properties of splitting criteria

Machine Learning
A simple, fast, and effective rule learner

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence

PaintingClass: interactive construction, visualization and exploration of decision trees

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Shape-Based Recognition of Wiry Objects

IEEE Transactions on Pattern Analysis and Machine Intelligence
FBP: A Frontier-Based Tree-Pruning Algorithm

INFORMS Journal on Computing
e-banking prediction using data mining methods

AIKED'05 Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases
Inducing interpretable voting classifiers without trading accuracy for simplicity: theoretical results, approximation algorithms, and experiments

Journal of Artificial Intelligence Research
EDLRT: Entropy-based dummy variables logistic regression tree

Intelligent Data Analysis
Prediction in financial markets: The case for small disjuncts

ACM Transactions on Intelligent Systems and Technology (TIST)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is concerned with the construction of regression and classification trees that are more adapted to data mining applications than conventional trees. To this end, we propose new splitting criteria for growing trees. Conventional splitting criteria attempt to perform well on both sides of a split by attempting a compromise in the quality of fit between the left and the right side. By contrast, we adopt a data mining point of view by proposing criteria that search for interesting subsets of the data, as opposed to modeling all of the data equally well. The new criteria do not split based on a compromise between the left and the right bucket; they effectively pick the more interesting bucket and ignore the other.As expected, the result is often a simpler characterization of interesting subsets of the data. Less expected is that the new criteria often yield whole trees that provide more interpretable data descriptions. Surprisingly, it is a "flaw" that works to their advantage: The new criteria have an increased tendency to accept splits near the boundaries of the predictor ranges. This so-called "end-cut problem" leads to the repeated peeling of small layers of data and results in very unbalanced but highly expressive and interpretable trees.