Effect of data discretization on the classification accuracy in a high-dimensional framework

Authors:
Annika Tillander
Affiliations:
Department of Statistics, Stockholm University, Stockholm, Sweden
Venue:
International Journal of Intelligent Systems
Year:
2012

Citing 10
Cited 0

Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
Discretization: An Enabling Technique

Data Mining and Knowledge Discovery
Incremental Induction of Decision Trees

Machine Learning
Evaluating the performance of cost-based discretization versus entropy-and error-based discretization

Computers and Operations Research
Analyzing microarray data using quantitative association rules

Bioinformatics
Discretization for naive-Bayes learning: managing discretization bias and variance

Machine Learning
Improved use of continuous attributes in C4.5

Journal of Artificial Intelligence Research
Study on Comparison of Discretization Methods

AICI '09 Proceedings of the 2009 International Conference on Artificial Intelligence and Computational Intelligence - Volume 04
ChiMerge: discretization of numeric attributes

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate discretization of continuous variables for classification problems in a high- dimensional framework. As the goal of classification is to correctly predict a class membership of an observation, we suggest a discretization method that optimizes the discretization procedure using the misclassification probability as a measure of the classification accuracy. Our method is compared to several other discretization methods as well as result for continuous data. To compare performance we consider three supervised classification methods, and to capture the effect of high dimensionality we investigate a number of feature variables for a fixed number of observations. Since discretization is a data transformation procedure, we also investigate how the dependence structure is affected by this. Our method performs well, and lower misclassification can be obtained in a high-dimensional framework for both simulated and real data if the continuous feature variables are first discretized. The dependence structure is well maintained for some discretization methods. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.