Distributed approximate spectral clustering for large-scale datasets

Authors:
Mohamed Hefeeda;Fei Gao;Wael Abd-Almageed
Affiliations:
Qatar Computing Research Institute, Doha, Qatar;Simon Fraser University, Surrey, BC, Canada;University of Maryland, College Park, MD, USA
Venue:
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Year:
2012

Citing 23
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Nonlinear component analysis as a kernel eigenvalue problem

Neural Computation
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
The Effect of the Input Density Distribution on Kernel-based Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Segmentation Using Eigenvectors: A Unifying View

ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Implementation of Kernel Methods on the GPU

DICTA '05 Proceedings of the Digital Image Computing on Techniques and Applications
Lower bounds on locality sensitive hashing

Proceedings of the twenty-second annual symposium on Computational geometry
An Experimental Study on Pedestrian Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
The dynamics of viral marketing

ACM Transactions on the Web (TWEB)
Approximation algorithms for co-clustering

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Multilingual spectral clustering using document similarity propagation

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Predicting positive and negative links in online social networks

Proceedings of the 19th international conference on World wide web
Web-scale computer vision using MapReduce for multimedia data mining

Proceedings of the Tenth International Workshop on Multimedia Data Mining
Parallel Spectral Clustering in Distributed Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
A Cluster Separation Measure

IEEE Transactions on Pattern Analysis and Machine Intelligence
Mahout in Action

Mahout in Action
Approximating a gram matrix for improved kernel-based learning

COLT'05 Proceedings of the 18th annual conference on Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.