INCONCO: interpretable clustering of numerical and categorical objects

Authors:
Claudia Plant;Christian Böhm
Affiliations:
Florida State University, Tallahassee, FL, USA;University of Munich, Munich, Germany
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 23
Cited 3

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Clustering Algorithms

Clustering Algorithms
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Clustering categorical data: an approach based on dynamical systems

The VLDB Journal — The International Journal on Very Large Data Bases
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Computing Clusters of Correlation Connected objects

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CURLER: finding and visualizing nonlinear correlation clusters

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Deriving quantitative models for correlation clusters

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)

Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)
Clicks: An effective algorithm for mining subspace clusters in categorical datasets

Data & Knowledge Engineering
What are the grand challenges for data mining?: KDD-2006 panel report

ACM SIGKDD Explorations Newsletter
A k-mean clustering algorithm for mixed numeric and categorical data

Data & Knowledge Engineering
Outlier-robust clustering using independent components

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Information and Complexity in Statistical Modeling

Information and Complexity in Statistical Modeling
Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Morpheus: interactive exploration of subspace clustering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Information theoretic measures for clusterings comparison: is a correction for chance necessary?

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Clustering mixed type attributes in large dataset

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Integrative parameter-free clustering of data with mixed type attributes

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I

Dependency clustering across measurement scales

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering heterogeneous data with mutual semi-supervision

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Massively parallel expectation maximization using graphics processing units

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The integrative mining of heterogeneous data and the interpretability of the data mining result are two of the most important challenges of today's data mining. It is commonly agreed in the community that, particularly in the research area of clustering, both challenges have not yet received the due attention. Only few approaches for clustering of objects with mixed-type attributes exist and those few approaches do not consider cluster-specific dependencies between numerical and categorical attributes. Likewise, only a few clustering papers address the problem of interpretability: to explain why a certain set of objects have been grouped into a cluster and what a particular cluster distinguishes from another. In this paper, we approach both challenges by constructing a relationship to the concept of data compression using the Minimum Description Length principle: a detected cluster structure is the better the more efficient it can be exploited for data compression. Following this idea, we can learn, during the run of a clustering algorithm, the optimal trade-off for attribute weights and distinguish relevant attribute dependencies from coincidental ones. We extend the efficient Cholesky decomposition to model dependencies in heterogeneous data and to ensure interpretability. Our proposed algorithm, INCONCO, successfully finds clusters in mixed type data sets, identifies the relevant attribute dependencies, and explains them using linear models and case-by-case analysis. Thereby, it outperforms existing approaches in effectiveness, as our extensive experimental evaluation demonstrates.