Clustering on the cloud: reducing CLARA to MapReduce

  • Authors:
  • Pelle Jakovits;Satish Narayana Srirama

  • Affiliations:
  • University of Tartu, Tartu, Estonia;University of Tartu, Tartu, Estonia

  • Venue:
  • Proceedings of the Second Nordic Symposium on Cloud Computing & Internet Technologies
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cloud computing, with its promise of virtually limitless resources, seems to suit well in solving resource intensive problems from machine learning and data mining domains, by allowing to scale any distributed data mining or machine learning application with little difficulty. However, to be able to run these applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like Hadoop MapReduce. It offers both automatic parallelization and fault tolerance on the cloud commodity hardware. However, it is not trivial to adapt complex algorithms to MapReduce model, as often it is more suited for simple and embarrassingly parallel algorithms. Yet, there are some types of more complex algorithms that are suitable for MapReduce and in this work we look at one such algorithm, Clustering LARge Applications (CLARA), which can be used for clustering extra large number of objects. The paper describes how CLARA is reduced to MapReduce model along with a detailed analysis in the Hadoop MapReduce implementation. The paper also provides a case study where the algorithm is successfully applied in clustering pen-based recognition of handwritten digits data set.