Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
Quasi-asynchronous migration: a novel migration protocol for PVM tasks
ACM SIGOPS Operating Systems Review
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Handbook of Applied Cryptography
Handbook of Applied Cryptography
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Managing Checkpoints for Parallel Programs
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Design, Implementation, and Performance of Checkpointing in NetSolve
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Journal of Systems Architecture: the EUROMICRO Journal
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Safe is Probabilistic Checkpointing?
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A Version of MASM Portable Across Different UNIX Systems and Different Hardware Architectures
DS-RT '05 Proceedings of the 9th IEEE International Symposium on Distributed Simulation and Real-Time Applications
The overhead model of word-level and page-level incremental checkpointing
Proceedings of the 2006 ACM symposium on Applied computing
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
Modular Checkpointing for Atomicity
Electronic Notes in Theoretical Computer Science (ENTCS)
Compiler-Enhanced Incremental Checkpointing
Languages and Compilers for Parallel Computing
Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack
Proceedings of the 46th Annual Design Automation Conference
Eliminating voltage emergencies via software-guided code transformations
ACM Transactions on Architecture and Code Optimization (TACO)
Lightweight checkpointing for concurrent ml
Journal of Functional Programming
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Cooperative checkpointing theory
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Impact on the writing granularity for incremental checkpointing
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
McrEngine: a scalable checkpointing system using data-aware aggregation and compression
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal checkpointing techniques. Incremental checkpointing is a well researched technique that addresses scalability concerns, but most of the implementations require paging support from hardware and the underlying operating system, which may not be always available. In this paper, we propose a software based adaptive incremental checkpoint technique which uses a secure hash function to uniquely identify changed blocks in memory. Our algorithm is the first self-optimizing algorithm that dynamically computes the optimal block boundaries, based on the history of changed blocks. This provides better opportunities for minimizing checkpoint file size. Since the hash is computed in software, we do not need any system support for this. We have implemented and tested this mechanism on the BlueGene/L system. Our results on several well-known benchmarks are encouraging, both in terms of reduction in average checkpoint file size and adaptivity towards application's memory access patterns.