ickp: A Consistent Checkpointer for Multicomputers

Authors:
James S. Plank;Kai Li
Affiliations:
-;-
Venue:
IEEE Parallel & Distributed Technology: Systems & Technology
Year:
1994

Citing 7
Cited 25

Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
On-line data compression in a log-structured file system

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)

A Dynamic Coherence Protocol for Distributed Shared Memory Enforcing High Data Availability at Low Costs

IEEE Transactions on Parallel and Distributed Systems
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
A checkpointing strategy for scalable recovery on distributed parallel systems

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine

EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Dome: Parallel Programming in a Distributed Computing Environment

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Supporting fault-tolerance in heterogeneous distributed applications

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Analytical study of migration-enhanced fault tolerance for long-running applications in IFR systems

International Journal of Parallel, Emergent and Distributed Systems
Design and performance evaluation of enhanced two-level recovery scheme

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
A novel fault-tolerant parallel algorithm

APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
libhashckpt: hash-based incremental checkpointing using GPU's

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Optimizing VM checkpointing for restore performance in VMware ESXi

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been much research on checkpointing algorithms for parallel and distributed systems; but surprisingly few implementations for uniprocessors, multiprocessors, and distributed systems, and none at all for multicomputers. We discuss ickp, our consistent checkpointer for the Intel iPSC/860, which is the first general-purpose checkpointer for a multicomputer. It is a checkpointing library that may be invoked asynchronously from the host processor, at a periodic interval, or by a library call. It implements three consistent checkpointing algorithms, two optimizations to reduce checkpoint time and overhead, and recovery.