Modeling Coordinated Checkpointing for Large-Scale Supercomputers

Authors:
Karthik Pattabiraman;Christopher Vick;Alan Wood
Affiliations:
University of Illinois at Urbana-Champaign;Sun Microsystems;Sun Microsystems
Venue:
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Year:
2005

Citing 0
Cited 6

Peer-to-peer checkpointing arrangement for mobile grid computing systems

Proceedings of the 16th international symposium on High performance distributed computing
Experimental Assessment of the Practicality of a Fault-Tolerant System

SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science
Modeling and Analysis of Checkpoint I/O Operations

ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications
Checkpointing and rollback recovery in distributed systems: existing solutions, open issues and proposed solutions

ICS'08 Proceedings of the 12th WSEAS international conference on Systems
Productive petascale computing: requirements, hardware, and software

Productive petascale computing: requirements, hardware, and software
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract. Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed. However, these systems are likely to experience far more frequent failures than today's systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic Activity Networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.