A grouping method for categorical attributes having very large number of values

Authors:
Marc Boullé
Affiliations:
France Telecom R&D, Lannion, France
Venue:
MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2005

Citing 8
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
Data preparation for data mining

Data preparation for data mining
Induction of Decision Trees

Machine Learning
Value Grouping for Binary Trees

Value Grouping for Binary Trees
Data Mining

Data Mining
ChiMerge: discretization of numeric attributes

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
An analysis of Bayesian classifiers

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
Induction of selective Bayesian classifiers

UAI'94 Proceedings of the Tenth international conference on Uncertainty in artificial intelligence

Wrapper discretization by means of estimation of distribution algorithms

Intelligent Data Analysis
Supervised selection of dynamic features, with an application to telecommunication data preparation

ICDM'06 Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In case of very large number of values, the risk of overfitting the data increases sharply and building good groupings becomes difficult. In this paper, we propose two new grouping methods founded on a Bayesian approach, leading to Bayes optimal groupings. The first method exploits a standard schema for grouping models and the second one extends this schema by managing a “garbage” group dedicated to the least frequent values. Extensive comparative experiments demonstrate that the new grouping methods build high quality groupings in terms of predictive quality, robustness and small number of groups.