New unsupervised clustering algorithm for large datasets

  • Authors:
  • William Peter;John Chiochetti;Clare Giardina

  • Affiliations:
  • BAE Systems Advanced Technologies, Columbia, MD;BAE Systems Advanced Technologies, Columbia, MD;BAE Systems Advanced Technologies, Columbia, MD

  • Venue:
  • Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

A fast and accurate unsupervised clustering algorithm has been developed for clustering very large datasets. Though designed for very large volumes of geospatial data, the algorithm is general enough to be used in a wide variety of domain applications. The number of computations the algorithm requires is ~ O(N), and thus faster than hierarchical algorithms. Unlike the popular K-means heuristic, this algorithm does not require a series of iterations to converge to a solution. In addition, this method does not depend on initialization of a given number of cluster representatives, and so is insensitive to initial conditions. Being unsupervised, the algorithm can also "rank" each cluster based on density. The method relies on weighting a dataset to grid points on a mesh, and using a small number of rule-based agents to find the high density clusters. This method effectively reduces large datasets to the size of the grid, which is usually many orders of magnitude smaller. Numerical experiments are shown that demonstrate the advantages of this algorithm over other techniques.