A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Authors:
Mostofa Ali Patwary;Diana Palsetia;Ankit Agrawal;Wei-keng Liao;Fredrik Manne;Alok Choudhary
Affiliations:
Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;Northwestern University, Evanston, IL;University of Bergen, Norway;Northwestern University, Evanston, IL
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 24
Cited 3

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Multidimensional binary search trees used for associative searching

Communications of the ACM
Approaches for scaling DBSCAN algorithm to large spatial databases

Journal of Computer Science and Technology
An improved equivalence algorithm

Communications of the ACM
Introduction to algorithms

Introduction to algorithms
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
High-performance data mining with skeleton-based structured parallel programming

Parallel Computing - Parallel data-intensive algorithms and applications
Experiments in Parallel Clustering with DBSCAN

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
WaveCluster: a wavelet-based clustering approach for spatial data in very large databases

The VLDB Journal — The International Journal on Very Large Data Bases
Design and Evaluation of a Parallel HOP Clustering Algorithm for Cosmological Simulation

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A hybrid unsupervised approach for document clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
ST-DBSCAN: An algorithm for clustering spatial-temporal data

Data & Knowledge Engineering
A simple and fast algorithm for K-medoids clustering

Expert Systems with Applications: An International Journal
Next Generation of Data Mining

Next Generation of Data Mining
Unsupervised Satellite Image Segmentation by Combining SA Based Fuzzy Clustering with Support Vector Machine

ICAPR '09 Proceedings of the 2009 Seventh International Conference on Advances in Pattern Recognition
A scalable parallel union-find algorithm for distributed memory computers

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
New multithreaded ordering and coloring algorithms for multicore architectures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Parallel density-based clustering of complex objects

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Experiments on union-find algorithms for the disjoint-set data structure

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Multi-core Spanning Forest Algorithms using the Disjoint-set Data Structure

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Scalable parallel OPTICS data clustering using graph algorithmic techniques

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Mr. Scan: extreme scale density-based clustering using a tree-based network of GPGPU nodes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On the usefulness of object tracking techniques in performance analysis

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel Dbscan algorithm (Pdsdbscan) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of Dbscan. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using data sets containing up to several hundred million high-dimensional points, we show that Pdsdbscan significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.