Riding the elephant: managing ensembles with hadoop

  • Authors:
  • Elif Dede;Madhusudhan Govindaraju;Daniel Gunter;Lavanya Ramakrishnan

  • Affiliations:
  • State University of New York (SUNY), Binghamton, NY, USA;State University of New York (SUNY), Binghamton, NY, USA;Lawrence Berkeley National Laboratory, Berekeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA

  • Venue:
  • Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant to Clean Energy Technologies, climate simulations, and Uncertainty Quantification in fluid and solid dynamics { all run large numbers of parallel analyses, which we call scientific ensembles. These scientific ensembles have a large number of tasks with control and data dependencies. Current tools for creating and managing these ensembles in HPC environments are limited and difficult to use; this is proving to be a limiting factor to running scientific ensembles at the large scale enabled by these HPC environments. MapReduce and its open-source implementation, Hadoop, is an attractive paradigm due to the simplicity of the programming model and intrinsic mechanisms for handling scalability and fault-tolerance. In this paper, we evaluate the programmability of MapReduce and Hadoop for scientific workflow ensembles.