Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Hi-index | 0.00 |
In this paper, we propose a new, efficient logging protocol, called lazy logging, and a fast crash recovery protocol, called prefetch-based crash recovery (PCR), for software distributed shared memory (SDSM). Our lazy logging protocol minimizes failure-free overhead by logging only data indispensable for correct recovery, while our PCR protocol reduces the recovery time by prefetching data according to the future memory access patterns, thus eliminating memory miss penalty during the recovery process.We have performed experiments on workstation clusters, comparing our protocols against the earlier reduced-stable logging (RSL) protocol by actually implementing both protocols in TreadMarks, a state-of-the-art SDSM system. The experimental results show that our lazy logging protocol consistently outperforms the RSL protocol. Our protocol increases the execution time slightly by 1% to 4% during failure-free execution, while the RSL protocol results in the execution time overhead of 6% to 21% due to its larger log size and higher disk access frequency. Our PCR protocol also outperforms the widely used simple crash recovery protocol by 18% to 57% under all applications examined.