Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

Authors:
Florin Sultan;Thu D. Nguyen;Liviu Iftode
Affiliations:
Rutgers Univ., Piscataway, NJ;Rutgers Univ., Piscataway, NJ;Rutgers Univ., Piscataway, NJ
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2002

Citing 25
Cited 1

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Lightweight recoverable virtual memory

ACM Transactions on Computer Systems (TOCS) - Special issue on operating systems principles
A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Ensuring correct rollback recovery in distributed shared memory systems

Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Hiding communication latency and coherence overhead in software DSMs

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Lightweight logging for lazy release consistent distributed shared memory

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Software Support for Virtual Memory-Mapped Communication

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Lazy Logging and Prefetch-Based Crash Recovery in Software Distributed Shared Memory Systems

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
The performance of consistent checkpointing in distributed shared memory systems

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Home-based shared virtual memory

Home-based shared virtual memory

Log-based rollback recovery without checkpoints of shared memory in software DSM

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Editor's Note: This paper unfortunately contains some errors which led to the paper being reprinted in the October 2002 issue. Please see IEEE Transactions on Parallel and Distributed Systems,vol. 13, no. 10, October 2002,pp. 1085-1098 for the correct paper (available without subscription).In this paper, we address the problem of garbage collection in a single-failure fault-tolerant home-based lazy release consistency (HLRC) distributed shared-memory (DSM) system based on independent checkpointing and logging. Our solution uses laziness in garbage collection and exploits consistency constraints of the HLRC memory model for low overhead and scalability. We prove safe bounds on the state that must be retained in the system to guarantee correct recovery after a failure. We devise two algorithms for garbage collection of checkpoints and logs, checkpoint garbage collection (CGC), and lazy log trimming (LLT). The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers. In such systems, using global synchronization or extra communication for garbage collection is inefficient or simply impractical due to system scale and temporary disconnections in communication. The challenge lies in controlling the size of the logs and the number of checkpoints without global synchronization while tolerating transient disruptions in communication. Our garbage collection scheme is completely distributed, does not force processes to synchronize, does not add extra messages to the base DSM protocol, and uses only the available DSM protocol information. Evaluation results for real applications show that it effectively bounds the number of past checkpoints to be retained and the size of the logs in stable storage.