VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

  • Authors:
  • Lavanya Ramakrishnan;Charles Koelbel;Yang-Suk Kee;Rich Wolski;Daniel Nurmi;Dennis Gannon;Graziano Obertelli;Asim YarKhan;Anirban Mandal;T. Mark Huang;Kiran Thyagaraja;Dmitrii Zagorodnov

  • Affiliations:
  • Indiana University, Bloomington;Rice University;Oracle US Inc.;University of California, Santa Barbara;University of California, Santa Barbara;Microsoft Research;University of California, Santa Barbara;University of Tennessee, Knoxville;Renaissance Computing Institute;University of Houston;Rice University;University of California, Santa Barbara

  • Venue:
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Today's scientific workflows use distributed heterogeneous resources through diverse grid and cloud interfaces that are often hard to program. In addition, especially for time-sensitive critical applications, predictable quality of service is necessary across these distributed resources. VGrADS' virtual grid execution system (vgES) provides an uniform qualitative resource abstraction over grid and cloud systems. We apply vgES for scheduling a set of deadline sensitive weather forecasting workflows. Specifically, this paper reports on our experiences with (1) virtualized reservations for batchqueue systems, (2) coordinated usage of TeraGrid (batch queue), Amazon EC2 (cloud), our own clusters (batch queue) and Eucalyptus (cloud) resources, and (3) fault tolerance through automated task replication. The combined effect of these techniques was to enable a new workflow planning method to balance performance, reliability and cost considerations. The results point toward improved resource selection and execution management support for a variety of e-Science applications over grids and cloud systems.