Performability modeling for scheduling and fault tolerance strategies for scientific workflows

Authors:
Lavanya Ramakrishnan;Daniel A. Reed
Affiliations:
Indiana University, Bloomington, IN, USA;Microsoft Research, Redmond, WA, USA
Venue:
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Year:
2008

Citing 15
Cited 7

Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package

Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems

Journal of Parallel and Distributed Computing
Enhancing the Fault Tolerance of Workflow Management Systems

IEEE Concurrency
Fault Tolerant Computing on the Grid: What are My Options?

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Performance Prediction in Production Environments

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather

Computing in Science and Engineering
Reliability challenges in large systems

Future Generation Computer Systems
Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Efficient resource description and high quality selection for virtual grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid - Volume 01
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
On Evaluating the Performability of Degradable Computing Systems

IEEE Transactions on Computers
Measuring the Performance and Reliability of Production Computational Grids

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Performance variability of highly parallel architectures

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
WORKEM: Representing and Emulating Distributed Scientific Workflow Execution State

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Availability Prediction Based Replication Strategies for Grid Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A scheduling model for workflows on grids and clouds

Proceedings of the 8th International Workshop on Middleware for Grids, Clouds and e-Science
A bi-criteria scheduling process with CoS support on grids and clouds

Concurrency and Computation: Practice & Experience
Predictable quality of service atop degradable distributed systems

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific applications have diverse characteristics and resource requirements. When combined with the complexity of underlying distributed resources on which they execute (e.g. Grid, cloud computing), these applications can experience significant performance fluctuations as machine reliability varies. Although the performance and reliability of cluster and Grid systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. To enable a dynamic environment that can account for such changes while providing required QoS, next generation tools will need extensible application interfaces that allow users to qualitatively express performance and reliability requirements for the underlying systems. In this paper, we use the concept of performability to capture the degraded performance that might result from varying resource availability. We apply the resulting model to workflow planning and fault tolerance strategies. We present experimental data to validate our model and use simulation results driven by failure data from real HPC systems to demonstrate how the proposed scheme better accounts for resource availability.