Optimal bin number for equal frequency discretizations in supervized learning

Authors:
Marc Boulle
Affiliations:
France Telecom R&D, 2, Avenue Pierre Marzin, 22300 Lannion, France. E-mail: marc.boulle@francetelecom.com
Venue:
Intelligent Data Analysis
Year:
2005

Citing 6
Cited 3

On the Handling of Continuous-Valued Attributes in Decision Tree Generation

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Discretization: An Enabling Technique

Data Mining and Knowledge Discovery
On Changing Continuous Attributes into Ordered Discrete Attributes

EWSL '91 Proceedings of the European Working Session on Machine Learning
Khiops: a discretization method of continuous attributes with guaranteed resistance to noise

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition

Wrapper discretization by means of estimation of distribution algorithms

Intelligent Data Analysis
Using correspondence analysis with a large set of transition matrices. Example with eye movement data and fuzzy space windowing

Intelligent Data Analysis
Verbs speak loud: verb categories in learning polarity and strength of opinions

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

While real data often comes in mixed format, discrete and continuous, many supervised induction algorithms require discrete data. Although efficient supervised discretization methods are available, the unsupervised Equal Frequency discretization method is still widely used by the statistician both for data exploration and data preparation. In this paper, we propose an automatic method, based on a Bayesian approach, to optimize the number of bins for Equal Frequency discretizations in the context of supervised learning. We introduce a space of Equal Frequency discretization models and a prior distribution defined on this model space. This results in the definition of a Bayes optimal evaluation criterion for Equal Frequency discretizations. We then propose an optimal search algorithm whose run-time is super-linear in the sample size. Extensive comparative experiments demonstrate that the method works quite well in many cases.