Using Self-Similarity to Cluster Large Data Sets

Authors:
Daniel Barbará;Ping Chen
Affiliations:
ISE Department, MSN 4A4, George Mason University, Fairfax, Virginia, 22030, USA. dbarbara@gmu.edu;Computer and Mathematical Science Department, University of Houston-Downtown, One Main Street, Houston, TX 77002, USA. Chenp@zeus.dt.uh.edu
Venue:
Data Mining and Knowledge Discovery
Year:
2003

Citing 0
Cited 8

Evaluating the intrinsic dimension of evolving data streams

Proceedings of the 2006 ACM symposium on Applied computing
A fast and effective method to find correlations among attributes in databases

Data Mining and Knowledge Discovery
Fractal dimension applied to plant identification

Information Sciences: an International Journal
Multifractal-based cluster hierarchy optimisation algorithm

International Journal of Business Intelligence and Data Mining
Measuring evolving data streams' behavior through their intrinsic dimension

New Generation Computing
K-means clustering versus validation measures: a data-distribution perspective

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Augmenting transportation-related recommendations through data mining

International Journal of Advanced Intelligence Paradigms
A modified fuzzy c-means algorithm for association rules clustering

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.