Accelerating MapReduce Analytics Using CometCloud

  • Authors:
  • Moustafa AbdelBaky;Hyunjoo Kim;Ivan Rodero;Manish Parashar

  • Affiliations:
  • -;-;-;-

  • Venue:
  • CLOUD '12 Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce-Hadoop has emerged as an effectiveframework for large-scale data analytics, providing support forexecuting jobs and storing data in a parallel and distributedmanner. MapReduce has been shown to perform very well onlarge datacenters running applications where the data can beeffectively divided into homogeneous chunks running across homogeneoushardware. However, the performance of MapReduce-Hadoop is far from ideal when either or both hardware anddatasets are heterogeneous. Such heterogeneity is unavoidablein many academic computing environments that use multiplegenerations of hardware, and share resources among users.Heterogeneity is also unavoidable in scientific applications thatprocess a varying number of datasets of different sizes. In thesecases, the performance of MapReduce-Hadoop can be a concern.In this paper, we implement MapReduce on top of CometCloudto address the issue of heterogeneity and support applicationsclasses that involve irregular datasets (e.g. large number of smalldata files or datasets of varying sizes). Furthermore, we developan autonomic manager that can schedule MapReduce tasks basedon user objective, provision resources accordingly, and supporton-demand scale up and cloudbursts. These resources can beselected from a hybrid infrastructure such as local clusters, datacenters, and public clouds. The performance of the developedsolution is verified using a protein data mining applicationoperating on data from the Protein Data Bank. The application isdeployed, based on deadline and budget constraints, on a clusterat Rutgers and/or Amazon EC2 resources. The experimentalresults show that the MapReduce-CometCloud framework caneffectively support applications operating on large numbers ofsmall data files on a heterogeneous and distributed environment,and satisfy user objective autonomically using cloudbursts.