Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

Authors:
Eugenio Cesario;Giuseppe Manco;Riccardo Ortale
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2007

Citing 34
Cited 11

Algorithms for clustering data

Algorithms for clustering data
Vector quantization and signal compression

Vector quantization and signal compression
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering transactions using large items

Proceedings of the eighth international conference on Information and knowledge management
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Clustering through decision tree construction

Proceedings of the ninth international conference on Information and knowledge management
An experimental comparison of model-based clustering methods

Machine Learning
Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Cluster validity methods: part I

ACM SIGMOD Record
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Techniques of Cluster Algorithms in Data Mining

Data Mining and Knowledge Discovery
Model selection for probabilistic clustering using cross-validatedlikelihood

Statistics and Computing
CLARANS: A Method for Clustering Objects for Spatial Data Mining

IEEE Transactions on Knowledge and Data Engineering
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Top-Down Induction of Clustering Trees

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Clustering Transactional Data

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases
CLOPE: a fast and effective clustering algorithm for transactional data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
Hypergraph Models and Algorithms for Data-Pattern-Based Clustering

Data Mining and Knowledge Discovery
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Entropy-based criterion in categorical clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Interpretable Hierarchical Clustering by Constructing an Unsupervised Decision Tree

IEEE Transactions on Knowledge and Data Engineering
Subspace clustering for high dimensional categorical data

ACM SIGKDD Explorations Newsletter
CLICKS: Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Practical Identifiability of Finite Mixtures of Multivariate Bernoulli Distributions

Neural Computation

Discovering Knowledge-Sharing Communities in Question-Answering Forums

ACM Transactions on Knowledge Discovery from Data (TKDD)
A novel attribute weighting algorithm for clustering high-dimensional categorical data

Pattern Recognition
Semi-supervised parameter-free divisive hierarchical clustering of categorical data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
A practical approach for clustering transaction data

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
DHCC: Divisive hierarchical clustering of categorical data

Data Mining and Knowledge Discovery
Clustering of heterogeneously typed data with soft computing - a case study

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
A self-organizing map for transactional data and the related categorical domain

Applied Soft Computing
Detecting and Tracking Topics and Events from Web Search Logs

ACM Transactions on Information Systems (TOIS)
A novel fuzzy clustering algorithm with between-cluster information for categorical data

Fuzzy Sets and Systems
Central clustering of categorical data with automated feature weighting

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
The k-modes type clustering plus between-cluster information for categorical data

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A parameter-free, fully-automatic approach to clustering high-dimensional categorical data is proposed. The technique is based on a two-phase iterative procedure, which attempts to improve the overall quality of the whole partition. In the first phase, cluster assignments are given, and a new cluster is added to the partition by choosing and splitting a low-quality cluster. In the second phase, the number of clusters is fixed, and an attempt to optimize cluster assignments is done. On the basis of such features, the algorithm attempts to improve the overall quality of the whole partition and finds clusters in the data, whose number is naturally established on the basis of the inherent features of the underlying dataset, rather than being previously specified. Furthermore, the approach is parametric to the notion of cluster quality: here, a cluster is defined as a set of tuples exhibiting a sort of homogeneity. We show how a suitable notion of cluster homogeneity can be defined in the context of high dimensional categorical data, from which an effective instance of the proposed clustering scheme immediately follows. Experiments on both synthetic and real data prove that the devised algorithm scales linearly and achieves nearly-optimal results in terms of compactness and separation.