p-PIC: Parallel power iteration clustering for big data

Authors:
Weizhong Yan;Umang Brahmakshatriya;Ya Xue;Mark Gilder;Bowden Wise
Affiliations:
Machine Learning Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Machine Learning Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Machine Learning Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Computing & Cyber Security Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Knowledge Discovery Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 22
Cited 0

Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Data clustering: a review

ACM Computing Surveys (CSUR)
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Fast Parallel Clustering Algorithm for Large Spatial Databases

Data Mining and Knowledge Discovery
Enhanced word clustering for hierarchical text classification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Spectral Grouping Using the Nyström Method

IEEE Transactions on Pattern Analysis and Machine Intelligence
Short communication: A novel parallelization approach for hierarchical clustering

Parallel Computing
Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP
Parallel Clustering Algorithms for Image Processing on Multi-core CPUs

CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 03
Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Fast approximate spectral clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast Spectral Clustering with Random Projection and Sampling

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
On evolutionary spectral clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Parallel K-Means Clustering Based on MapReduce

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Fast large-scale spectral clustering by sequential shrinkage optimization

ECIR'07 Proceedings of the 29th European conference on IR research
Parallel Spectral Clustering in Distributed Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Sparse kernel spectral clustering models for large-scale data analysis

Neurocomputing
CAD: an efficient data management and migration scheme across clouds for data-intensive scientific applications

Globe'11 Proceedings of the 4th international conference on Data management in grid and peer-to-peer systems
Analytics over large-scale multidimensional data: the big data revolution!

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Power iteration clustering (PIC) is a newly developed clustering algorithm. It performs clustering by embedding data points in a low-dimensional subspace derived from the similarity matrix. Compared to traditional clustering algorithms, PIC is simple, fast and relatively scalable. However, it requires the data and its associated similarity matrix fit into memory, which makes the algorithm infeasible for big data applications. This paper attempts to expand PIC's data scalability by implementing a parallel power iteration clustering (p-PIC). While this paper focuses on exploring different parallelization strategies and implementation details for minimizing computation and communication costs, we have also paid great attention to ensuring the algorithm works well on low-end commodity computers (COTS-based clusters and general purpose servers found at most commercial cloud providers). The experimental results demonstrate that the proposed p-PIC algorithm is highly scalable to both data and compute resources.