Scalable parallel OPTICS data clustering using graph algorithmic techniques

  • Authors:
  • Mostofa Ali Patwary;Diana Palsetia;Ankit Agrawal;Wei-keng Liao;Fredrik Manne;Alok Choudhary

  • Affiliations:
  • Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;University of Bergen, Norway;Northwestern University, Evanston, IL

  • Venue:
  • SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (Poptics) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and Prim's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.