Context-Based Distance Learning for Categorical Data Clustering

Authors:
Dino Ienco;Ruggero G. Pensa;Rosa Meo
Affiliations:
Dept. of Computer Science, University of Torino, Italy;Dept. of Computer Science, University of Torino, Italy;Dept. of Computer Science, University of Torino, Italy
Venue:
IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Year:
2009

Citing 12
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
An introduction to variable and feature selection

The Journal of Machine Learning Research
Entropy-based criterion in categorical clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
CLICKS: Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A j . We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.