Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

Authors:
A. J. Oliner;R. K. Sahoo;J. E. Moreira;M. Gupta
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA;IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Year:
2005

Citing 17
Cited 9

An evaluation of parallel job scheduling for ASCI Blue-Pacific

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Performance analysis of checkpointing strategies

ACM Transactions on Computer Systems (TOCS)
Scheduling with unexpected machine breakdowns

Discrete Applied Mathematics
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
Improved Utilization and Responsiveness with Gang Scheduling

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Job Scheduling for the BlueGene/L System

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A comparative analysis of event tupling schemes

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Predicting Rare Events In Temporal Domains

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
An Efficient Fault-Tolerant Scheduling Algorithm for Real-Time Tasks with Precedence Constraints in Heterogeneous Systems

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
VAX/VMS Event Monitoring and Analysis

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using checkpointing to recover from poor multi-site parallel job scheduling decisions

Proceedings of the 5th international workshop on Middleware for grid computing: held at the ACM/IFIP/USENIX 8th International Middleware Conference
Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Petascale system management experiences

LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Failure-aware workflow scheduling in cluster environments

Cluster Computing
Evaluating cooperative checkpointing for supercomputing systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cooperative checkpointing theory

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a common technique for mitigating the amount of work lost due to job failures, but its effectiveness under realistic circumstances has not been studied. In this paper, we analyze the system-level performance of periodic application checkpointing using parameters similar to those projected for BlueGene/L systems. Our results reflect simulations on a toroidal interconnect architecture, using a real job log from a machine similar to BlueGene/L, and with a real failure distribution from a large-scale cluster. Our simulation studies investigate the impact of parameters such as checkpoint overhead and checkpoint interval on a number of performance metrics, including bounded slowdown, system utilization, and total work lost. The results suggest that periodic checkpointing may not be an effective way to improve the average bounded slowdown or average system utilization metrics, though it reduces the amount of work lost due to failures. We show that overzealous checkpointing with high overhead can amplify the effects of failures. The study also suggests that new metrics and checkpointing techniques may be required to effectively handle job failures on large-scale machines like BlueGene/L.