Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

Authors:
J. S. Plank
Affiliations:
-
Venue:
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Year:
1996

Citing 21
Cited 8

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
On distributed snapshots

Information Processing Letters
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Demonic memory for process histories

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Estimating Capacity for Sharing in a Privately Owned Workstation Environment

IEEE Transactions on Software Engineering
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compiler-assisted full checkpointing

Software—Practice & Experience
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Evaluation of checkpoint mechanisms for massively parallel machines

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-likeSystems

A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-likeSystems
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
SCR algorithm: saving/restoring states of file systems

ACM SIGOPS Operating Systems Review
Easing the management of data-parallel systems via adaptation

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Distributed Storage Layout Schemes

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Improving Goodput by Coscheduling CPU and Network Capacity

International Journal of High Performance Computing Applications
A novel fault-tolerant parallel algorithm

APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Robust distributed orthogonalization based on randomized aggregation

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration, coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept, there are several design decisions concerning the placement of checkpoint files that can impact the performance and functionality of coordinated checkpointers. Although several such checkpointers have been implemented for popular programming platforms like PVM and MPI, none have taken this issue into consideration. This paper addresses the issue of checkpoint placement and its impact on the performance and functionality of coordinated checkpointing systems. Several strategies, both old and new, are described and implemented on a network of SPARC-5 workstations running PVM. These strategies range from very simple to more complex borrowing heavily from ideas in RAID (Redundant Arrays of Inexpensive Disks) fault-tolerance. The results of this paper will serve as a guide so that future implementations of coordinated checkpointing can allow their users to achieve the combination of performance and functionality that is right for their applications.