A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

Authors:
Daniele Micci-Barreca
Affiliations:
ClearCommerce Corporation, Austin, TX
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2001

Citing 5
Cited 2

On estimating probabilities in tree pruning

EWSL-91 Proceedings of the European working session on learning on Machine learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Automating exploratory data analysis for efficient data mining

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Induction of Decision Trees

Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning

Mapping nominal values to numbers for effective visualization

Information Visualization - Special issue of selected and extended InfoVis 03 papers
Mapping nominal values to numbers for effective visualization

INFOVIS'03 Proceedings of the Ninth annual IEEE conference on Information visualization

Quantified Score

Hi-index	0.01

Visualization

Abstract

Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.