Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
Journal of Parallel and Distributed Computing
Enhancing the Fault Tolerance of Workflow Management Systems
IEEE Concurrency
Fault Tolerant Computing on the Grid: What are My Options?
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Performance Prediction in Production Environments
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Service-Oriented Environments for Dynamically Interacting with Mesoscale Weather
Computing in Science and Engineering
Reliability challenges in large systems
Future Generation Computer Systems
Scalable Grid Application Scheduling via Decoupled Resource Selection and Scheduling
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Efficient resource description and high quality selection for virtual grids
CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid - Volume 01
Task scheduling strategies for workflow-based applications in grids
CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
On Evaluating the Performability of Degradable Computing Systems
IEEE Transactions on Computers
Measuring the Performance and Reliability of Production Computational Grids
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
Performance variability of highly parallel architectures
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
WORKEM: Representing and Emulating Distributed Scientific Workflow Execution State
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Availability Prediction Based Replication Strategies for Grid Environments
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A scheduling model for workflows on grids and clouds
Proceedings of the 8th International Workshop on Middleware for Grids, Clouds and e-Science
A bi-criteria scheduling process with CoS support on grids and clouds
Concurrency and Computation: Practice & Experience
Predictable quality of service atop degradable distributed systems
Cluster Computing
Hi-index | 0.00 |
Scientific applications have diverse characteristics and resource requirements. When combined with the complexity of underlying distributed resources on which they execute (e.g. Grid, cloud computing), these applications can experience significant performance fluctuations as machine reliability varies. Although the performance and reliability of cluster and Grid systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. To enable a dynamic environment that can account for such changes while providing required QoS, next generation tools will need extensible application interfaces that allow users to qualitatively express performance and reliability requirements for the underlying systems. In this paper, we use the concept of performability to capture the degraded performance that might result from varying resource availability. We apply the resulting model to workflow planning and fault tolerance strategies. We present experimental data to validate our model and use simulation results driven by failure data from real HPC systems to demonstrate how the proposed scheme better accounts for resource availability.