ACM Transactions on Computer Systems (TOCS)
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
On-line data compression in a log-structured file system
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
The Journal of Supercomputing
A checkpointing strategy for scalable recovery on distributed parallel systems
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Dome: Parallel Programming in a Distributed Computing Environment
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Supporting fault-tolerance in heterogeneous distributed applications
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems
International Journal of Parallel, Emergent and Distributed Systems
Design and performance evaluation of enhanced two-level recovery scheme
PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
A novel fault-tolerant parallel algorithm
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
On the viability of checkpoint compression for extreme scale fault tolerance
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
McrEngine: a scalable checkpointing system using data-aware aggregation and compression
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
The Journal of Supercomputing
Optimizing VM checkpointing for restore performance in VMware ESXi
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
There has been much research on checkpointing algorithms for parallel and distributed systems; but surprisingly few implementations for uniprocessors, multiprocessors, and distributed systems, and none at all for multicomputers. We discuss ickp, our consistent checkpointer for the Intel iPSC/860, which is the first general-purpose checkpointer for a multicomputer. It is a checkpointing library that may be invoked asynchronously from the host processor, at a periodic interval, or by a library call. It implements three consistent checkpointing algorithms, two optimizations to reduce checkpoint time and overhead, and recovery.