Finding Dense Clusters in Hyperspace: An Approach Based on Row Shuffling

Authors:
Daniel Barbará;Xintao Wu
Affiliations:
-;-
Venue:
WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
Year:
2001

Citing 7
Cited 0

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Using approximations to scale exploratory data analysis in datacubes

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Using Loglinear Models to Compress Datacube

WAIM '00 Proceedings of the First International Conference on Web-Age Information Management
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

High dimensional data sets generally exhibit low density, since the number of possible cells exceeds the actual number of cells in the set. This characteristic has prompted researchers to automate the search for subspaces where the density is higher. In this paper we present an algorithm that takes advantage of categorical, unordered dimensions to increase the density of subspaces in the data set. It does this by shuffling rows in those dimensions, so the final ordering results in increased density of regions in hyperspace. We argue for the usage of this shuffling technique as a preprocessing step for other techniques that compress the hyperspace by means of statistical models, since denser regions usually result in better-fitting models. The experimental results support this argument. We also show how to integrate this algorithm with two grid clustering procedures in order to find these dense regions. The experimental results in both synthetic and real data sets show that row-shuffling can drastically increase the density of the subspaces, leading to better clusters.