C4.5: programs for machine learning
C4.5: programs for machine learning
CACTUS—clustering categorical data using summaries
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: concepts and techniques
Data mining: concepts and techniques
COOLCAT: an entropy-based algorithm for categorical clustering
Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
Data Mining and Knowledge Discovery
ROCK: A Robust Clustering Algorithm for Categorical Attributes
ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
An introduction to variable and feature selection
The Journal of Machine Learning Research
Entropy-based criterion in categorical clustering
ICML '04 Proceedings of the twenty-first international conference on Machine learning
CLICKS: Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Hi-index | 0.00 |
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A j . We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.