Robust information-theoretic clustering

Authors:
Christian Böhm;Christos Faloutsos;Jia-Yu Pan;Claudia Plant
Affiliations:
University of Munich, Munich, Germany;CMU, Pittsburgh, PA;CMU, Pittsburgh, PA;UMIT, Hall, Austria
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 14
Cited 11

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Clustering Algorithms

Clustering Algorithms
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Computing Clusters of Correlation Connected objects

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
CURLER: finding and visualizing nonlinear correlation clusters

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
ViVo: Visual Vocabulary Construction for Mining Biomedical Images

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining

Outlier-robust clustering using independent components

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Summarizing spatial data streams using ClusterHulls

Journal of Experimental Algorithmics (JEA)
Data weaving: scaling up the state-of-the-art in data clustering

Proceedings of the 17th ACM conference on Information and knowledge management
CoCo: coding cost for parameter-free outlier detection

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying the components

Data Mining and Knowledge Discovery
Entropy-based motion segmentation from a moving platform

IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
Clustering by synchronization

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
ITCH: information-theoretic cluster hierarchies

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
Genetic algorithm for finding cluster hierarchies

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Integrative parameter-free clustering of data with mixed type attributes

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Measuring non-gaussianity by phi-transformed and fuzzy histograms

Advances in Artificial Neural Systems - Special issue on Advances in Unsupervised Learning Techniques Applied to Biosciences and Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

How do we find a natural clustering of a real world point set, which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? Most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parameters, and they are sensitive to noise. In this paper, we propose a robust framework for determining a natural clustering of a given data set, based on the minimum description length (MDL) principle. The proposed framework, Robust Information-theoretic Clustering (RIC), is orthogonal to any known clustering algorithm: given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. Our RIC method can be combined with any clustering technique ranging from K-means and K-medoids to advanced methods such as spectral clustering. In fact, RIC is even able to purify and improve an initial coarse clustering, even if we start with very simple methods such as grid-based space partitioning. Moreover, RIC scales well with the data set size. Extensive experiments on synthetic and real world data sets validate the proposed RIC framework.