Enabling large-scale scientific workflows on petascale resources using MPI master/worker

Authors:
Mats Rynge;Scott Callaghan;Ewa Deelman;Gideon Juve;Gaurang Mehta;Karan Vahi;Philip J. Maechling
Affiliations:
University of Southern California;University of Southern California;University of Southern California;University of Southern California;University of Southern California;University of Southern California;University of Southern California
Venue:
Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Year:
2012

Citing 16
Cited 1

An Enabling Framework for Master-Worker Applications on the Computational Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Compact DAG representation and its symbolic scheduling

Journal of Parallel and Distributed Computing
ASKALON: a tool set for cluster and Grid computing: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Workflow task clustering for best effort systems with Pegasus

Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities
The Pilot Way to Grid Resources Using glideinWMS

CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 02
The Globus Replica Location Service: Design and Experience

IEEE Transactions on Parallel and Distributed Systems
Advance reservation policies for workflows

JSSPP'06 Proceedings of the 12th international conference on Job scheduling strategies for parallel processing
Experiences with resource provisioning for scientific workflows using Corral

Scientific Programming
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
Metrics for heterogeneous scientific workflows: A case study of an earthquake science application

International Journal of High Performance Computing Applications
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
Experiences Using GlideinWMS and the Corral Frontend across Cyberinfrastructures

ESCIENCE '11 Proceedings of the 2011 IEEE Seventh International Conference on eScience

Efficient programming paradigm for video streaming processing on TILE64 platform

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computational scientists often need to execute large, loosely-coupled parallel applications such as workflows and bags of tasks in order to do their research. These applications are typically composed of many, short-running, serial tasks, which frequently demand large amounts of computation and storage. In order to produce results in a reasonable amount of time, scientists would like to execute these applications using petascale resources. In the past this has been a challenge because petascale systems are not designed to execute such workloads efficiently. In this paper we describe a new approach to executing large, fine-grained workflows on distributed petascale systems. Our solution involves partitioning the workflow into independent subgraphs, and then submitting each subgraph as a self-contained MPI job to the available resources (often remote). We describe how the partitioning and job management has been implemented in the Pegasus Workflow Management System. We also explain how this approach provides an end-to-end solution for challenges related to system architecture, queue policies and priorities, and application reuse and development. Finally, we describe how the system is being used to enable the execution of a very large seismic hazard analysis application on XSEDE resources.