Adaptive incremental checkpointing for massively parallel systems

Authors:
Saurabh Agarwal;Rahul Garg;Meeta S. Gupta;Jose E. Moreira
Affiliations:
IBM India Research Labs, New Delhi, India;IBM India Research Labs, New Delhi, India;IBM India Research Labs, New Delhi, India;IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the 18th annual international conference on Supercomputing
Year:
2004

Citing 17
Cited 22

Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Quasi-asynchronous migration: a novel migration protocol for PVM tasks

ACM SIGOPS Operating Systems Review
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Handbook of Applied Cryptography

Handbook of Applied Cryptography
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Managing Checkpoints for Parallel Programs

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Design, Implementation, and Performance of Checkpointing in NetSolve

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Secure checkpointing

Journal of Systems Architecture: the EUROMICRO Journal
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Safe is Probabilistic Checkpointing?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A Version of MASM Portable Across Different UNIX Systems and Different Hardware Architectures

DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
The overhead model of word-level and page-level incremental checkpointing

Proceedings of the 2006 ACM symposium on Applied computing
Stabilizers: a modular checkpointing abstraction for concurrent functional programs

Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Modular Checkpointing for Atomicity

Electronic Notes in Theoretical Computer Science (ENTCS)
Compiler-Enhanced Incremental Checkpointing

Languages and Compilers for Parallel Computing
Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack

Proceedings of the 46th Annual Design Automation Conference
Eliminating voltage emergencies via software-guided code transformations

ACM Transactions on Architecture and Code Optimization (TACO)
Lightweight checkpointing for concurrent ml

Journal of Functional Programming
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Cooperative checkpointing theory

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Providing resiliency for optical grids by exploiting relocation: A dimensioning study based on ILP

Computer Communications
libhashckpt: hash-based incremental checkpointing using GPU's

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Impact on the writing granularity for incremental checkpointing

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
A case for tracking and exploiting inter-node and intra-node memory content sharing in virtualized large-scale parallel systems

Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal checkpointing techniques. Incremental checkpointing is a well researched technique that addresses scalability concerns, but most of the implementations require paging support from hardware and the underlying operating system, which may not be always available. In this paper, we propose a software based adaptive incremental checkpoint technique which uses a secure hash function to uniquely identify changed blocks in memory. Our algorithm is the first self-optimizing algorithm that dynamically computes the optimal block boundaries, based on the history of changed blocks. This provides better opportunities for minimizing checkpoint file size. Since the hash is computed in software, we do not need any system support for this. We have implemented and tested this mechanism on the BlueGene/L system. Our results on several well-known benchmarks are encouraging, both in terms of reduction in average checkpoint file size and adaptivity towards application's memory access patterns.