Predictable quality of service atop degradable distributed systems

Authors:
Lavanya Ramakrishnan;Daniel A. Reed
Affiliations:
Indiana University, Bloomington, USA;Microsoft Research, Redmond, USA
Venue:
Cluster Computing
Year:
2013

Citing 18
Cited 1

Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package

Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems

Journal of Parallel and Distributed Computing
Enhancing the Fault Tolerance of Workflow Management Systems

IEEE Concurrency
Toward a Framework for Preparing and Executing Adaptive Grid Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fault Tolerant Computing on the Grid: What are My Options?

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Performance Prediction in Production Environments

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Parallel scheduling of complex dags under uncertainty

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather

Computing in Science and Engineering
Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
On Evaluating the Performability of Degradable Computing Systems

IEEE Transactions on Computers
Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms

Scientific Programming - Scientific Workflows
Performability modeling for scheduling and fault tolerance strategies for scientific workflows

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Measuring the Performance and Reliability of Production Computational Grids

GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Reliability challenges in large systems

Future Generation Computer Systems
Performance variability of highly parallel architectures

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII

A grid workflow Quality-of-Service estimation based on resource availability prediction

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

High performance and distributed computing systems such as peta-scale, grid and cloud infrastructure are increasingly used for running scientific models and business services. These systems experience large availability variations through hardware and software failures. Resource providers need to account for these variations while providing the required QoS at appropriate costs in dynamic resource and application environments. Although the performance and reliability of these systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. In this paper, we present a resource performability model to estimate lost performance and corresponding cost considerations with varying availability levels. We use the resulting model in a multi-phase planning approach for scheduling a set of deadline-sensitive meteorological workflows atop grid and cloud resources to trade-off performance, reliability and cost. We use simulation results driven by failure data collected over the lifetime of high performance systems to demonstrate how the proposed scheme better accounts for resource availability.