Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Lightweight recoverable virtual memory
ACM Transactions on Computer Systems (TOCS) - Special issue on operating systems principles
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Ensuring correct rollback recovery in distributed shared memory systems
Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Hiding communication latency and coherence overhead in software DSMs
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Cashmere-2L: software coherent shared memory on a clustered remote-write network
Proceedings of the sixteenth ACM symposium on Operating systems principles
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Software Support for Virtual Memory-Mapped Communication
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Lazy Logging and Prefetch-Based Crash Recovery in Software Distributed Shared Memory Systems
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
The performance of consistent checkpointing in distributed shared memory systems
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Home-based shared virtual memory
Home-based shared virtual memory
A model for characterizing the scalability of distributed systems
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
In this paper, we address the problem of garbage collection in a single-failure fault-tolerant home-based lazy release consistency (HLRC) distributed shared-memory (DSM) system based on independent checkpointing and logging. Our solution uses laziness in garbage collection and exploits consistency constraints of the HLRC memory model for low overhead and scalability. We prove safe bounds on the state that must be retained in the system to guarantee correct recovery after a failure. We devise two algorithms for garbage collection of checkpoints and logs, checkpoint garbage collection (CGC), and lazy log trimming (LLT). The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers. In such systems, using global synchronization or extra communication for garbage collection is inefficient or simply impractical due to system scale and temporary disconnections in communication. The challenge lies in controlling the size of the logs and the number of checkpoints without global synchronization while tolerating transient disruptions in communication. Our garbage collection scheme is completely distributed, does not force processes to synchronize, does not add extra messages to the base DSM protocol, and uses only the available DSM protocol information. Evaluation results for real applications show that it effectively bounds the number of past checkpoints to be retained and the size of the logs in stable storage.