BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Finding generalized projected clusters in high dimensional spaces
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Clustering Algorithms
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
Data Mining and Knowledge Discovery
X-means: Extending K-means with Efficient Estimation of the Number of Clusters
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Clustering categorical data: an approach based on dynamical systems
The VLDB Journal — The International Journal on Very Large Data Bases
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Computing Clusters of Correlation Connected objects
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CURLER: finding and visualizing nonlinear correlation clusters
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Deriving quantitative models for correlation clusters
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)
Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)
Clicks: An effective algorithm for mining subspace clusters in categorical datasets
Data & Knowledge Engineering
What are the grand challenges for data mining?: KDD-2006 panel report
ACM SIGKDD Explorations Newsletter
A k-mean clustering algorithm for mixed numeric and categorical data
Data & Knowledge Engineering
Outlier-robust clustering using independent components
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Information and Complexity in Statistical Modeling
Information and Complexity in Statistical Modeling
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Morpheus: interactive exploration of subspace clustering
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
ACM Transactions on Knowledge Discovery from Data (TKDD)
Information theoretic measures for clusterings comparison: is a correction for chance necessary?
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Clustering mixed type attributes in large dataset
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Integrative parameter-free clustering of data with mixed type attributes
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Dependency clustering across measurement scales
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering heterogeneous data with mutual semi-supervision
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Massively parallel expectation maximization using graphics processing units
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
The integrative mining of heterogeneous data and the interpretability of the data mining result are two of the most important challenges of today's data mining. It is commonly agreed in the community that, particularly in the research area of clustering, both challenges have not yet received the due attention. Only few approaches for clustering of objects with mixed-type attributes exist and those few approaches do not consider cluster-specific dependencies between numerical and categorical attributes. Likewise, only a few clustering papers address the problem of interpretability: to explain why a certain set of objects have been grouped into a cluster and what a particular cluster distinguishes from another. In this paper, we approach both challenges by constructing a relationship to the concept of data compression using the Minimum Description Length principle: a detected cluster structure is the better the more efficient it can be exploited for data compression. Following this idea, we can learn, during the run of a clustering algorithm, the optimal trade-off for attribute weights and distinguish relevant attribute dependencies from coincidental ones. We extend the efficient Cholesky decomposition to model dependencies in heterogeneous data and to ensure interpretability. Our proposed algorithm, INCONCO, successfully finds clusters in mixed type data sets, identifies the relevant attribute dependencies, and explains them using linear models and case-by-case analysis. Thereby, it outperforms existing approaches in effectiveness, as our extensive experimental evaluation demonstrates.