On the Reliability of the IBM MVS/XA Operating System
IEEE Transactions on Software Engineering
Task Allocation for Maximizing Reliability of Distributed Computer Systems
IEEE Transactions on Computers
STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems
IEEE Transactions on Computers
Optimal policy for batch operations: backup, checkpointing, reorganization, and updating
ACM Transactions on Database Systems (TODS)
An evaluation of parallel job scheduling for ASCI Blue-Pacific
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Scheduling with unexpected machine breakdowns
Discrete Applied Mathematics
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
Modeling and analysis of dynamic coscheduling in parallel and distributed environments
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Drowsy caches: simple techniques for reducing leakage power
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
The Impact of Migration on Parallel Job Scheduling for Distributed Systems
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration
IEEE Transactions on Parallel and Distributed Systems
A comparative analysis of event tupling schemes
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
VAX/VMS Event Monitoring and Analysis
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Effect of System Workload on Operating System Reliability: A Study on IBM 3081
IEEE Transactions on Software Engineering
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Failure-aware checkpointing in fine-grained cycle sharing systems
Proceedings of the 16th international symposium on High performance distributed computing
Using queue structures to improve job reliability
Proceedings of the 16th international symposium on High performance distributed computing
Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Dynamic Grid Scheduling Using Job Runtime Requirements and Variable Resource Availability
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Proceedings of the 2009 workshop on Resiliency in high performance
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Current research and practice in proactive fault management
International Journal of Computers and Applications
Journal of Parallel and Distributed Computing
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Failure-aware workflow scheduling in cluster environments
Cluster Computing
The importance of complete data sets for job scheduling simulations
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Dependable Grid Workflow Scheduling Based on Resource Availability
Journal of Grid Computing
Hi-index | 0.00 |
As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.