Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
Journal of Parallel and Distributed Computing
Enhancing the Fault Tolerance of Workflow Management Systems
IEEE Concurrency
Toward a Framework for Preparing and Executing Adaptive Grid Programs
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fault Tolerant Computing on the Grid: What are My Options?
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Performance Prediction in Production Environments
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Parallel scheduling of complex dags under uncertainty
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather
Computing in Science and Engineering
Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Task scheduling strategies for workflow-based applications in grids
CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
On Evaluating the Performability of Degradable Computing Systems
IEEE Transactions on Computers
Scientific Programming - Scientific Workflows
Performability modeling for scheduling and fault tolerance strategies for scientific workflows
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Measuring the Performance and Reliability of Production Computational Grids
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Reliability challenges in large systems
Future Generation Computer Systems
Performance variability of highly parallel architectures
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A grid workflow Quality-of-Service estimation based on resource availability prediction
The Journal of Supercomputing
Hi-index | 0.00 |
High performance and distributed computing systems such as peta-scale, grid and cloud infrastructure are increasingly used for running scientific models and business services. These systems experience large availability variations through hardware and software failures. Resource providers need to account for these variations while providing the required QoS at appropriate costs in dynamic resource and application environments. Although the performance and reliability of these systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. In this paper, we present a resource performability model to estimate lost performance and corresponding cost considerations with varying availability levels. We use the resulting model in a multi-phase planning approach for scheduling a set of deadline-sensitive meteorological workflows atop grid and cloud resources to trade-off performance, reliability and cost. We use simulation results driven by failure data collected over the lifetime of high performance systems to demonstrate how the proposed scheme better accounts for resource availability.