SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The integrated microbial genomes (IMG) system: a case study in biological data management
VLDB '05 Proceedings of the 31st international conference on Very large data bases
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Evaluating MapReduce for Multi-core and Multiprocessor Systems
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Bioinformatics
Lambda calculus as a workflow model
Concurrency and Computation: Practice & Experience - Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008)
CloudWF: A Computational Workflow System for Clouds Based on Hadoop
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
A formal semantics for the Taverna 2 workflow model
Journal of Computer and System Sciences
Design patterns for efficient graph algorithms in MapReduce
Proceedings of the Eighth Workshop on Mining and Learning with Graphs
A multi-dimensional classification model for scientific workflow characteristics
Proceedings of the 1st International Workshop on Workflow Approaches to New Data-centric Science
Twister: a runtime for iterative MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Spark: cluster computing with working sets
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
A model of computation for MapReduce
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Rapid parallel genome indexing with MapReduce
Proceedings of the second international workshop on MapReduce and its applications
Magellan: experiences from a science cloud
Proceedings of the 2nd international workshop on Scientific cloud computing
Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
MARIANE: MApReduce Implementation Adapted for HPC Environments
GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Benchmarking MapReduce Implementations for Application Usage Scenarios
GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
SIDR: structure-aware intelligent data routing in Hadoop
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Many important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant to Clean Energy Technologies, climate simulations, and Uncertainty Quantification in fluid and solid dynamics { all run large numbers of parallel analyses, which we call scientific ensembles. These scientific ensembles have a large number of tasks with control and data dependencies. Current tools for creating and managing these ensembles in HPC environments are limited and difficult to use; this is proving to be a limiting factor to running scientific ensembles at the large scale enabled by these HPC environments. MapReduce and its open-source implementation, Hadoop, is an attractive paradigm due to the simplicity of the programming model and intrinsic mechanisms for handling scalability and fault-tolerance. In this paper, we evaluate the programmability of MapReduce and Hadoop for scientific workflow ensembles.