p-PIC: Parallel power iteration clustering for big data

  • Authors:
  • Weizhong Yan;Umang Brahmakshatriya;Ya Xue;Mark Gilder;Bowden Wise

  • Affiliations:
  • Machine Learning Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Machine Learning Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Machine Learning Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Computing & Cyber Security Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States;Knowledge Discovery Lab, GE Global Research Center, One Research Circle, Niskayuna, NY 12039, United States

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Power iteration clustering (PIC) is a newly developed clustering algorithm. It performs clustering by embedding data points in a low-dimensional subspace derived from the similarity matrix. Compared to traditional clustering algorithms, PIC is simple, fast and relatively scalable. However, it requires the data and its associated similarity matrix fit into memory, which makes the algorithm infeasible for big data applications. This paper attempts to expand PIC's data scalability by implementing a parallel power iteration clustering (p-PIC). While this paper focuses on exploring different parallelization strategies and implementation details for minimizing computation and communication costs, we have also paid great attention to ensuring the algorithm works well on low-end commodity computers (COTS-based clusters and general purpose servers found at most commercial cloud providers). The experimental results demonstrate that the proposed p-PIC algorithm is highly scalable to both data and compute resources.