An empirical investigation of the impact of discretization on common data distributions

Authors:
Michael K. Ismail;Vic Ciesielski
Affiliations:
Department of Computer Science, RMIT University, Melbourne, VIC 3001, AUSTRALIA;Department of Computer Science, RMIT University, Melbourne, VIC 3001, AUSTRALIA
Venue:
Design and application of hybrid intelligent systems
Year:
2003

Citing 4
Cited 4

On changing continuous attributes into ordered discrete attributes

EWSL-91 Proceedings of the European working session on learning on Machine learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Forming Categories in Exploratory Data Analysis and Data Mining

IDA '97 Proceedings of the Second International Symposium on Advances in Intelligent Data Analysis, Reasoning about Data

Effects of discretization on determination of coronary artery disease using support vector machine

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Artificial Intelligence Based Green Technology Retrofit for Misfire Detection in Old Engines

International Journal of Green Computing
On the interplay of machine learning and background knowledge in image interpretation by Bayesian networks

Artificial Intelligence in Medicine
Compact classification of optimized Boolean reasoning with Particle Swarm Optimization

Intelligent Data Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

This study attempts to identify the merits of six of the most popular discretization methods when confronted with a randomly generated dataset consisting of attributes that conform to one of eight common statistical distributions. It is hoped that the analysis will enlighten as to a heuristic which identifies the most appropriate discretization method to be applied, given some preliminary analysis or visualization to determine the type of statistical distribution of the attribute to be discretized. Further, the comparative effectiveness of discretization given each data distribution is a primary focus. Analysis of the data was accomplished by inducing a decision tree classifier (C4.5) on the discretized data and an error measure was used to determine the relative value of discretization. The experiments showed that the method of discretization and the level of inherent error placed in the class attribute has a major impact on classification errors generated post-discretization. More importantly, the general effectiveness of discretization varies significantly depending on the shape of data distribution considered. Distributions that are highly skewed or have high peaks tend to result in higher classification errors, and the relative superiority of supervised discretization over unsupervised discretization is diminished significantly when applied to these data distributions.