Finding Dense Clusters in Hyperspace: An Approach Based on Row Shuffling

  • Authors:
  • Daniel Barbará;Xintao Wu

  • Affiliations:
  • -;-

  • Venue:
  • WAIM '01 Proceedings of the Second International Conference on Advances in Web-Age Information Management
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

High dimensional data sets generally exhibit low density, since the number of possible cells exceeds the actual number of cells in the set. This characteristic has prompted researchers to automate the search for subspaces where the density is higher. In this paper we present an algorithm that takes advantage of categorical, unordered dimensions to increase the density of subspaces in the data set. It does this by shuffling rows in those dimensions, so the final ordering results in increased density of regions in hyperspace. We argue for the usage of this shuffling technique as a preprocessing step for other techniques that compress the hyperspace by means of statistical models, since denser regions usually result in better-fitting models. The experimental results support this argument. We also show how to integrate this algorithm with two grid clustering procedures in order to find these dense regions. The experimental results in both synthetic and real data sets show that row-shuffling can drastically increase the density of the subspaces, leading to better clusters.