Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Information Processing Letters
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Demonic memory for process histories
PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Estimating Capacity for Sharing in a Privately Owned Workstation Environment
IEEE Transactions on Software Engineering
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compiler-assisted full checkpointing
Software—Practice & Experience
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Evaluation of checkpoint mechanisms for massively parallel machines
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-likeSystems
A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-likeSystems
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
IEEE Transactions on Parallel and Distributed Systems
SCR algorithm: saving/restoring states of file systems
ACM SIGOPS Operating Systems Review
Easing the management of data-parallel systems via adaptation
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Distributed Storage Layout Schemes
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Improving Goodput by Coscheduling CPU and Network Capacity
International Journal of High Performance Computing Applications
A novel fault-tolerant parallel algorithm
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Robust distributed orthogonalization based on randomized aggregation
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Comparing checkpoint and rollback recovery schemes in a cluster system
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Hi-index | 0.00 |
Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration, coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept, there are several design decisions concerning the placement of checkpoint files that can impact the performance and functionality of coordinated checkpointers. Although several such checkpointers have been implemented for popular programming platforms like PVM and MPI, none have taken this issue into consideration. This paper addresses the issue of checkpoint placement and its impact on the performance and functionality of coordinated checkpointing systems. Several strategies, both old and new, are described and implemented on a network of SPARC-5 workstations running PVM. These strategies range from very simple to more complex borrowing heavily from ideas in RAID (Redundant Arrays of Inexpensive Disks) fault-tolerance. The results of this paper will serve as a guide so that future implementations of coordinated checkpointing can allow their users to achieve the combination of performance and functionality that is right for their applications.