VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Authors:
Lavanya Ramakrishnan;Charles Koelbel;Yang-Suk Kee;Rich Wolski;Daniel Nurmi;Dennis Gannon;Graziano Obertelli;Asim YarKhan;Anirban Mandal;T. Mark Huang;Kiran Thyagaraja;Dmitrii Zagorodnov
Affiliations:
Indiana University, Bloomington;Rice University;Oracle US Inc.;University of California, Santa Barbara;University of California, Santa Barbara;Microsoft Research;University of California, Santa Barbara;University of Tennessee, Knoxville;Renaissance Computing Institute;University of Houston;Rice University;University of California, Santa Barbara
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 27
Cited 13

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
Predicting Queue Times on Space-Sharing Parallel Computers

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Parallel scheduling of complex dags under uncertainty

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather

Computing in Science and Engineering
Predicting bounds on queuing delay for batch-scheduled parallel machines

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Reliability challenges in large systems

Future Generation Computer Systems
Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Improving grid resource allocation via integrated selection and binding

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Toward a doctrine of containment: grid hosting with adaptive resource control

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Workflows for e-Science: Scientific Workflows for Grids

Workflows for e-Science: Scientific Workflows for Grids
Grid Resource Abstraction, Virtualization, and Provisioning for Time-Targeted Applications

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms

Scientific Programming - Scientific Workflows
Performability modeling for scheduling and fault tolerance strategies for scientific workflows

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
VARQ: virtual advance reservations for queues

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Combining batch execution and leasing using virtual machines

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Amazon S3 for science grids: a viable solution?

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
Agility in virtualized utility computing

VTDC '07 Proceedings of the 2nd international workshop on Virtualization technology in distributed computing
Using clouds to address grid limitations

Proceedings of the 6th international workshop on Middleware for grid computing
The Eucalyptus Open-Source Cloud-Computing System

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
QBETS: queue bounds estimation from time series

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
On the impact of reservations from the grid on planning-based resource management

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part III

WORKEM: Representing and Emulating Distributed Scientific Workflow Execution State

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Virtual Resources Allocation for Workflow-Based Applications Distribution on a Cloud Infrastructure

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Comparison of resource platform selection approaches for scientific workflows

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Power-Aware Consolidation of Scientific Workflows in Virtualized Environments

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Hybrid Computing-Where HPC meets grid and Cloud Computing

Future Generation Computer Systems
Joint Elastic Cloud and Virtual Network Framework for Application Performance-cost Optimization

Journal of Grid Computing
Magellan: experiences from a science cloud

Proceedings of the 2nd international workshop on Scientific cloud computing
Strategies for Rescheduling Tightly-Coupled Parallel Applications in Multi-Cluster Grids

Journal of Grid Computing
Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds

Future Generation Computer Systems
Large improvements in application throughput of long-running multi-component applications using batch grids

Concurrency and Computation: Practice & Experience
Impact of variable priced cloud resources on scientific workflow scheduling

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Consolidated cluster systems for data centers in the cloud age: a survey and analysis

Frontiers of Computer Science: Selected Publications from Chinese Universities
An energy and deadline aware resource provisioning, scheduling and optimization framework for cloud systems

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's scientific workflows use distributed heterogeneous resources through diverse grid and cloud interfaces that are often hard to program. In addition, especially for time-sensitive critical applications, predictable quality of service is necessary across these distributed resources. VGrADS' virtual grid execution system (vgES) provides an uniform qualitative resource abstraction over grid and cloud systems. We apply vgES for scheduling a set of deadline sensitive weather forecasting workflows. Specifically, this paper reports on our experiences with (1) virtualized reservations for batchqueue systems, (2) coordinated usage of TeraGrid (batch queue), Amazon EC2 (cloud), our own clusters (batch queue) and Eucalyptus (cloud) resources, and (3) fault tolerance through automated task replication. The combined effect of these techniques was to enable a new workflow planning method to balance performance, reliability and cost considerations. The results point toward improved resource selection and execution management support for a variety of e-Science applications over grids and cloud systems.