An evaluation of parallel job scheduling for ASCI Blue-Pacific
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Performance analysis of checkpointing strategies
ACM Transactions on Computer Systems (TOCS)
Scheduling with unexpected machine breakdowns
Discrete Applied Mathematics
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
Improved Utilization and Responsiveness with Gang Scheduling
IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Job Scheduling for the BlueGene/L System
JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A comparative analysis of event tupling schemes
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Predicting Rare Events In Temporal Domains
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
VAX/VMS Event Monitoring and Analysis
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Filtering Failure Logs for a BlueGene/L Prototype
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using checkpointing to recover from poor multi-site parallel job scheduling decisions
Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Petascale system management experiences
LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Failure-aware workflow scheduling in cluster environments
Cluster Computing
Evaluating cooperative checkpointing for supercomputing systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstances has not been studied. In this paper, we analyze the system-level performance of periodic application checkpointing using parameters similar to those projected for BlueGene/L systems. Our results reflect simulations on a toroidal interconnect architecture, using a real job log from a machine similar to BlueGene/L, and with a real failure distribution from a large-scale cluster. Our simulation studies investigate the impact of parameters such as checkpoint overhead and checkpoint interval on a number of performance metrics, including bounded slowdown, system utilization, and total work lost. The results suggest that periodic checkpointing may not be an effective way to improve the average bounded slowdown or average system utilization metrics, though it reduces the amount of work lost due to failures. We show that overzealous checkpointing with high overhead can amplify the effects of failures. The study also suggests that new metrics and checkpointing techniques may be required to effectively handle job failures on large-scale machines like BlueGene/L.