General and Efficient Multisplitting of Numerical Attributes

  • Authors:
  • Tapio Elomaa;Juho Rousu

  • Affiliations:
  • Department of Computer Science, P.O. Box 26, FIN-00014 University of Helsinki, Finland. elomaa@cs.helsinki.fi;VTT Biotechnology and Food Research, Tietotie 2, P.O. Box 1501, FIN-02044 VTT, Finland. Juho.Rousu@vtt.fi

  • Venue:
  • Machine Learning
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Often in supervised learning numerical attributes requirespecial treatment and do not fit the learning scheme as well as onecould hope. Nevertheless, they are common in practical tasks and,therefore, need to be taken into account. We characterize thewell-behavedness of an evaluation function, a property thatguarantees the optimal multi-partition of an arbitrary numericaldomain to be defined on boundary points. Well-behavedness reduces thenumber of candidate cut points that need to be examined inmultisplitting numerical attributes. Many commonly used attributeevaluation functions possess this property; we demonstrate that thecumulative functions Information Gain and Training Set Error as wellas the non-cumulative functions Gain Ratio and Normalized DistanceMeasure are all well-behaved. We also devise a method of findingoptimal multisplits efficiently by examining the minimum number ofboundary point combinations that is required to produce partitionswhich are optimal with respect to a cumulative and well-behavedevaluation function. Our empirical experiments validate the utilityof optimal multisplitting: it produces constantly better partitionsthan alternative approaches do and it only requires comparable time.In top-down induction of decision trees the choice of evaluationfunction has a more decisive effect on the result than the choice ofpartitioning strategy; optimizing the value of most common attributeevaluation functions does not raise the accuracy of the produceddecision trees. In our tests the construction time using optimalmultisplitting was, on the average, twice that required by greedymultisplitting, which in its part required on the average twice thetime of binary splitting.