An Exact Probability Metric for Decision Tree Splitting and Stopping

  • Authors:
  • J. Kent Martin

  • Affiliations:
  • Department of Information and Computer Science, University of California, Irvine, Irvine, CA 92692. E-mail: jmartin@ics.uci.edu

  • Venue:
  • Machine Learning
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

ID3‘s information gain heuristic is well-known to be biased towardsmulti-valued attributes. This bias is only partially compensated forby C4.5‘s gain ratio. Several alternatives have been proposed andare examined here (distance, orthogonality, a Beta function, and twochi-squared tests). All of these metrics are biased towards splitswith smaller branches, where low-entropy splits are likely to occurby chance. Both classical and Bayesian statistics lead to themultiple hypergeometric distribution as the exact posteriorprobability of the null hypothesis that the class distribution isindependent of the split. Both gain and the chi-squared tests arisein asymptotic approximations to the hypergeometric, with similarcriteria for their admissibility. Previous failures of pre-pruningare traced in large part to coupling these biased approximations withone another or with arbitrary thresholds; problems which are overcomeby the hypergeometric. The choice of split-selection metrictypically has little effect on accuracy, but can profoundly affectcomplexity and the effectiveness and efficiency of pruning.Empirical results show that hypergeometric pre-pruning should be donein most cases, as trees pruned in this way are simpler and moreefficient, and typically no less accurate than unpruned orpost-pruned trees.