PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Distributed file systems: concepts and examples
ACM Computing Surveys (CSUR)
Efficient checkpointing on MIMD architectures
Efficient checkpointing on MIMD architectures
RAID: high-performance, reliable secondary storage
ACM Computing Surveys (CSUR)
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
High-Level Fault Tolerance in Distributed Programs
High-Level Fault Tolerance in Distributed Programs
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Dome: Parallel Programming in a Heterogeneous Multi-User Environment
Dome: Parallel Programming in a Heterogeneous Multi-User Environment
MPVM: A Migration Transparent Version of PVM
MPVM: A Migration Transparent Version of PVM
A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-likeSystems
A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-likeSystems
Hi-index | 0.00 |
Fault-tolerance is very important in cluster computing. Many famous cluster-computing systems have implemented fault-tolerance by using checkpoint/restart mechanism. But existent checkpointing algorithms can not restore the states of a file system when roll-backing the running of a program, so there are many restrictions on file accesses in existent fault-tolerance systems. SCR algorithm, an algorithm based on atomic operation and consistent schedule, which can restore the states of file systems, is present in this paper. In SCR algorithm, system calls on file sytems are classified into idempotent operations and non-idempotent operations. A non-idempotent operation modifies a file system's states, and an idempotent operation does not. SCR algorithm dynamically follows the tracks of a program's running, logs each non-idempotent operation used by the program and the information that can restore the operation in disks. When checkpointing roll-backing the program, SCR algorithm will revert the file system states to the last checkpoint time. By using SCR algorithm, users are allowed to use any file operation in their programs.