The minimum code length for clustering using the gray code

Authors:
Mahito Sugiyama;Akihiro Yamamoto
Affiliations:
Graduate School of Informatics, Kyoto University, Kyoto, Japan and The Japan Society for the Promotion of Science;Graduate School of Informatics, Kyoto University, Kyoto, Japan
Venue:
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Year:
2011

Citing 15
Cited 0

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Journal of Computational and Applied Mathematics
Data clustering: a review

ACM Computing Surveys (CSUR)
Computable analysis: an introduction

Computable analysis: an introduction
Cure: an efficient clustering algorithm for large databases

Information Systems
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Real number computation through gray code embedding

Theoretical Computer Science
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Computational cluster validation in post-genomic data analysis

Bioinformatics
The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming)

The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming)
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Knowledge and Information Systems
Multi-dimensional Mass Estimation and Mass-based Clustering

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Clustering by compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose new approaches to exploit compression algorithms for clustering numerical data. Our first contribution is to design a measure that can score the quality of a given clustering result under the light of a fixed encoding scheme. We call this measure the Minimum Code Length (MCL). Our second contribution is to propose a general strategy to translate any encoding method into a cluster algorithm, which we call COOL (COding-Oriented cLustering). COOL has a low computational cost since it scales linearly with the data set size. The clustering results of COOL is also shown to minimize MCL. To illustrate further this approach, we consider the Gray Code as the encoding scheme to present GCOOL. G-COOL can find clusters of arbitrary shapes and remove noise. Moreover, it is robust to change in the input parameters; it requires only two lower bounds for the number of clusters and the size of each cluster, whereas most algorithms for finding arbitrarily shaped clusters work well only if all parameters are tuned appropriately. G-COOL is theoretically shown to achieve internal cohesion and external isolation and is experimentally shown to work well for both synthetic and real data sets.