Partitioning Nominal Attributes in Decision Trees

  • Authors:
  • Don Coppersmith;Se June Hong;Jonathan R. M. Hosking

  • Affiliations:
  • IBM Research Division, T.J. Watson Research Center, Yorktown Heights, NY 10598, USA. copper@watson.ibm.com;IBM Research Division, T.J. Watson Research Center, Yorktown Heights, NY 10598, USA. hong@watson.ibm.com;IBM Research Division, T.J. Watson Research Center, Yorktown Heights, NY 10598, USA. hosking@watson.ibm.com

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

To find the optimal branching of a nominal attribute at a node in anL-ary decision tree, one is often forced to search over all possibleL-ary partitions for the one that yields the minimum impurity measure.For binary trees (L = 2) when there are just two classes a short-cutsearch is possible that is linear in n, the number of distinct valuesof the attribute.For the general case in which the number of classes, k, may begreater than two, Burshtein et al. have shown that the optimalpartition satisfies a condition that involves the existence of({L\atop 2}) hyperplanes in the class probability space.We derive a property of the optimal partition for concave impurity measures (including in particularthe Gini and entropy impurity measures) in terms of the existence ofL vectors in the dualof the class probability space, which implies the earlier condition.Unfortunately, these insights still do not offer a practical searchmethod when n and k are large, even for binary trees.We therefore present a new heuristic search algorithm tofind a good partition. It is based on ordering the attribute‘svalues according to their principal component scores in the classprobability space, and is linear in n. We demonstrate theeffectiveness of the new method through Monte Carlo simulationexperiments and compare its performance against other heuristic methods.