Clustering on the cloud: reducing CLARA to MapReduce

Authors:
Pelle Jakovits;Satish Narayana Srirama
Affiliations:
University of Tartu, Tartu, Estonia;University of Tartu, Tartu, Estonia
Venue:
Proceedings of the Second Nordic Symposium on Cloud Computing & Internet Technologies
Year:
2013

Citing 11
Cited 0

MPI: The Complete Reference

MPI: The Complete Reference
CLARANS: A Method for Clustering Objects for Spatial Data Mining

IEEE Transactions on Knowledge and Data Engineering
Monte Carlo Statistical Methods (Springer Texts in Statistics)

Monte Carlo Statistical Methods (Springer Texts in Statistics)
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Monte Carlo methods for matrix computations on the grid

Future Generation Computer Systems
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

Future Generation Computer Systems
SciCloud: Scientific Computing on the Cloud

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Adapting scientific computing problems to clouds using MapReduce

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud computing, with its promise of virtually limitless resources, seems to suit well in solving resource intensive problems from machine learning and data mining domains, by allowing to scale any distributed data mining or machine learning application with little difficulty. However, to be able to run these applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like Hadoop MapReduce. It offers both automatic parallelization and fault tolerance on the cloud commodity hardware. However, it is not trivial to adapt complex algorithms to MapReduce model, as often it is more suited for simple and embarrassingly parallel algorithms. Yet, there are some types of more complex algorithms that are suitable for MapReduce and in this work we look at one such algorithm, Clustering LARge Applications (CLARA), which can be used for clustering extra large number of objects. The paper describes how CLARA is reduced to MapReduce model along with a detailed analysis in the Hadoop MapReduce implementation. The paper also provides a case study where the algorithm is successfully applied in clustering pen-based recognition of handwritten digits data set.