Scalable parallel OPTICS data clustering using graph algorithmic techniques

Authors:
Mostofa Ali Patwary;Diana Palsetia;Ankit Agrawal;Wei-keng Liao;Fredrik Manne;Alok Choudhary
Affiliations:
Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;University of Bergen, Norway;Northwestern University, Evanston, IL
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 31
Cited 0

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Multidimensional binary search trees used for associative searching

Communications of the ACM
Approaches for scaling DBSCAN algorithm to large spatial databases

Journal of Computer Science and Technology
An improved equivalence algorithm

Communications of the ACM
Introduction to algorithms

Introduction to algorithms
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
High-performance data mining with skeleton-based structured parallel programming

Parallel Computing - Parallel data-intensive algorithms and applications
Parallel Implementation of Borvka's Minimum Spanning Tree Algorithm

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Experiments in Parallel Clustering with DBSCAN

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
WaveCluster: a wavelet-based clustering approach for spatial data in very large databases

The VLDB Journal — The International Journal on Very Large Data Bases
Design and Evaluation of a Parallel HOP Clustering Algorithm for Cosmological Simulation

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A hybrid unsupervised approach for document clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Hierarchical Density-Based Clustering of Uncertain Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
ST-DBSCAN: An algorithm for clustering spatial-temporal data

Data & Knowledge Engineering
A simple and fast algorithm for K-medoids clustering

Expert Systems with Applications: An International Journal
Next Generation of Data Mining

Next Generation of Data Mining
Unsupervised Satellite Image Segmentation by Combining SA Based Fuzzy Clustering with Support Vector Machine

ICAPR '09 Proceedings of the 2009 Seventh International Conference on Advances in Pattern Recognition
Semi-supervised Density-Based Clustering

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
A scalable parallel union-find algorithm for distributed memory computers

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Parallel density-based clustering of complex objects

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Experiments on union-find algorithms for the disjoint-set data structure

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Scalable parallel minimum spanning forest computation

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Multi-core Spanning Forest Algorithms using the Disjoint-set Data Structure

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using the omega index for evaluating abstractive community detection

Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
Overlapping community detection in networks: The state-of-the-art and comparative study

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (Poptics) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and Prim's Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.