The effect of small disjuncts and class distribution on decision tree learning

  • Authors:
  • Gary Mitchell Weiss;Haym Hirsh

  • Affiliations:
  • -;-

  • Venue:
  • The effect of small disjuncts and class distribution on decision tree learning
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The main goal of classifier learning is to generate a model that makes few misclassification errors. Given this emphasis on error minimization, it makes sense to try to understand how the induction process gives rise to classifiers that make errors and whether we can identify those parts of the classifier that generate most of the errors. In this thesis we provide the first comprehensive studies of two major sources of classification errors. The first study concerns small disjuncts, which are those disjuncts within a classifier that cover only a few training examples. An analysis of classifiers induced from thirty data sets shows that these small disjuncts are extremely error prone and often account for the majority of all classification errors. Because small disjuncts largely determine classifier performance, we use them as a “lens” through which to study classifier induction. Factors such as pruning, training-set size, noise and class imbalance are each analyzed to determine how they affect small disjuncts and, more generally, classifier learning. The second study analyzes the effect that rare classes and class distribution have on learning. Those examples belonging to rare classes are shown to be misclassified much more often than common classes. The thesis then goes on to analyze the impact that varying the class distribution of the training data has on classifier performance. The experimental results indicate that the naturally occurring class distribution is not always best for learning and that a balanced class distribution should be chosen to generate a classifier robust to different misclassification costs. It is often necessary to limit the amount of training data used for learning, due to the costs associated with obtaining and learning from this data. This thesis presents a budget-sensitive progressive-sampling algorithm for selecting training examples in this situation. This algorithm is shown to produce a class distribution that performs quite well for learning (i.e., is near optimal).