libhashckpt: hash-based incremental checkpointing using GPU's

Authors:
Kurt B. Ferreira;Rolf Riesen;Ron Brighwell;Patrick Bridges;Dorian Arnold
Affiliations:
Scalable System Software, Sandia National Laboratories;IBM Research, Ireland;Scalable System Software, Sandia National Laboratories;Department of Computer Science, University of New Mexico;Department of Computer Science, University of New Mexico
Venue:
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Year:
2011

Citing 15
Cited 8

IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
Necessary and Sufficient Conditions for Consistent Global Snapshots

IEEE Transactions on Parallel and Distributed Systems
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Handbook of Applied Cryptography

Handbook of Applied Cryptography
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Probabilistic Checkpointing

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
How Safe is Probabilistic Checkpointing?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Process Hijacking

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A Secure Checkpointing System

PRDC '01 Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Compiler-enhanced incremental checkpointing for OpenMP applications

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A tunable, software-based DRAM error detection and correction library for HPC

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Evaluating the feasibility of using memory content similarity to improve system resilience

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.01

Visualization

Abstract

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.