SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Authors:
Vineet Chaoji;Mohammad Al Hasan;Saeed Salem;Mohammed J. Zaki
Affiliations:
Rensselaer Polytechnic Institute, Computer Science Department, 12180, Troy, NY, USA;Rensselaer Polytechnic Institute, Computer Science Department, 12180, Troy, NY, USA;Rensselaer Polytechnic Institute, Computer Science Department, 12180, Troy, NY, USA;Rensselaer Polytechnic Institute, Computer Science Department, 12180, Troy, NY, USA
Venue:
Knowledge and Information Systems
Year:
2009

Citing 29
Cited 3

Algorithms for clustering data

Algorithms for clustering data
Spatial tessellations: concepts and applications of Voronoi diagrams

Spatial tessellations: concepts and applications of Voronoi diagrams
Topology representing networks

Neural Networks
A dynamic approach for clustering data

Signal Processing
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Handbook of computational geometry

Handbook of computational geometry
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to algorithms

Introduction to algorithms
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
A Two-Round Variant of EM for Gaussian Mixtures

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Geographic Data Mining and Knowledge Discovery

Geographic Data Mining and Knowledge Discovery
Clustering intrusion detection alarms to support root cause analysis

ACM Transactions on Information and System Security (TISSEC)
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Kernel k-means: spectral clustering and normalized cuts

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An Efficient Density-based Approach for Data Mining Tasks

Knowledge and Information Systems
Hierarchical Clustering Algorithms for Document Datasets

Data Mining and Knowledge Discovery
Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing

Knowledge and Information Systems
The minimum consistent subset cover problem and its applications in data mining

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Knowledge Discovery in Bioinformatics: Techniques, Methods, and Applications (Wiley Series in Bioinformatics)

Knowledge Discovery in Bioinformatics: Techniques, Methods, and Applications (Wiley Series in Bioinformatics)
Top 10 algorithms in data mining

Knowledge and Information Systems
Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction
DENCLUE 2.0: fast clustering based on kernel density estimation

IDA'07 Proceedings of the 7th international conference on Intelligent data analysis

SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index

Pattern Recognition
The minimum code length for clustering using the gray code

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
On clustering large number of data streams

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is one of the fundamental data mining tasks. Many different clustering paradigms have been developed over the years, which include partitional, hierarchical, mixture model based, density-based, spectral, subspace, and so on. The focus of this paper is on full-dimensional, arbitrary shaped clusters. Existing methods for this problem suffer either in terms of the memory or time complexity (quadratic or even cubic). This shortcoming has restricted these algorithms to datasets of moderate sizes. In this paper we propose SPARCL, a simple and scalable algorithm for finding clusters with arbitrary shapes and sizes, and it has linear space and time complexity. SPARCL consists of two stages—the first stage runs a carefully initialized version of the Kmeans algorithm to generate many small seed clusters. The second stage iteratively merges the generated clusters to obtain the final shape-based clusters. Experiments were conducted on a variety of datasets to highlight the effectiveness, efficiency, and scalability of our approach. On the large datasets SPARCL is an order of magnitude faster than the best existing approaches.