Riding the elephant: managing ensembles with hadoop

Authors:
Elif Dede;Madhusudhan Govindaraju;Daniel Gunter;Lavanya Ramakrishnan
Affiliations:
State University of New York (SUNY), Binghamton, NY, USA;State University of New York (SUNY), Binghamton, NY, USA;Lawrence Berkeley National Laboratory, Berekeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Venue:
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Year:
2011

Citing 20
Cited 1

The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The integrated microbial genomes (IMG) system: a case study in biological data management

VLDB '05 Proceedings of the 31st international conference on Very large data bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
CloudBurst

Bioinformatics
Lambda calculus as a workflow model

Concurrency and Computation: Practice & Experience - Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008)
CloudWF: A Computational Workflow System for Clouds Based on Hadoop

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
A formal semantics for the Taverna 2 workflow model

Journal of Computer and System Sciences
Design patterns for efficient graph algorithms in MapReduce

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
A multi-dimensional classification model for scientific workflow characteristics

Proceedings of the 1st International Workshop on Workflow Approaches to New Data-centric Science
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Rapid parallel genome indexing with MapReduce

Proceedings of the second international workshop on MapReduce and its applications
Magellan: experiences from a science cloud

Proceedings of the 2nd international workshop on Scientific cloud computing
Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
MARIANE: MApReduce Implementation Adapted for HPC Environments

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Benchmarking MapReduce Implementations for Application Usage Scenarios

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing

SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant to Clean Energy Technologies, climate simulations, and Uncertainty Quantification in fluid and solid dynamics { all run large numbers of parallel analyses, which we call scientific ensembles. These scientific ensembles have a large number of tasks with control and data dependencies. Current tools for creating and managing these ensembles in HPC environments are limited and difficult to use; this is proving to be a limiting factor to running scientific ensembles at the large scale enabled by these HPC environments. MapReduce and its open-source implementation, Hadoop, is an attractive paradigm due to the simplicity of the programming model and intrinsic mechanisms for handling scalability and fault-tolerance. In this paper, we evaluate the programmability of MapReduce and Hadoop for scientific workflow ensembles.