IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
Reliable Cluster Computing with a New Checkpointing RAID-x Architecture
HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
Fault-tolerant stream processing using a distributed, replicated file system
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
The distributed RAID for serverless cluster computer is used to save the checkpoint files periodically according to the checkpointing algorithm for rollback recovery. Striped checkpointing algorithm newly proposed in this paper can adopt the merits of the coordinated and the staggered checkpointing algorithms. Coordinating enables parallel I/O on distributed disks and staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. The striped checkpointing approach allows dynamical reconfiguration to minimize checkpointing overhead among concurrent software processes.We demonstrate how to reduce the overhead by striping and staggering dynamically. For communication-intensive computational programs, this new scheme can significantly reduce the checkpointing overhead. Linpack HPC Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing algorithm for fast rollback recovery from any single node failure in a cluster computer.