An empirical investigation of the impact of discretization on common data distributions

  • Authors:
  • Michael K. Ismail;Vic Ciesielski

  • Affiliations:
  • Department of Computer Science, RMIT University, Melbourne, VIC 3001, AUSTRALIA;Department of Computer Science, RMIT University, Melbourne, VIC 3001, AUSTRALIA

  • Venue:
  • Design and application of hybrid intelligent systems
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

This study attempts to identify the merits of six of the most popular discretization methods when confronted with a randomly generated dataset consisting of attributes that conform to one of eight common statistical distributions. It is hoped that the analysis will enlighten as to a heuristic which identifies the most appropriate discretization method to be applied, given some preliminary analysis or visualization to determine the type of statistical distribution of the attribute to be discretized. Further, the comparative effectiveness of discretization given each data distribution is a primary focus. Analysis of the data was accomplished by inducing a decision tree classifier (C4.5) on the discretized data and an error measure was used to determine the relative value of discretization. The experiments showed that the method of discretization and the level of inherent error placed in the class attribute has a major impact on classification errors generated post-discretization. More importantly, the general effectiveness of discretization varies significantly depending on the shape of data distribution considered. Distributions that are highly skewed or have high peaks tend to result in higher classification errors, and the relative superiority of supervised discretization over unsupervised discretization is diminished significantly when applied to these data distributions.