Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Hiding communication latency and coherence overhead in software DSMs
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
The Legion vision of a worldwide virtual computer
Communications of the ACM
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Cashmere-2L: software coherent shared memory on a clustered remote-write network
Proceedings of the sixteenth ACM symposium on Operating systems principles
MultiView and Millipage — fine-grain sharing in page-based DSMs
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems
Software—Practice & Experience
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A Case for NOW (Networks of Workstations)
IEEE Micro
Software Support for Virtual Memory-Mapped Communication
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The performance of consistent checkpointing in distributed shared memory systems
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Home-based shared virtual memory
Home-based shared virtual memory
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Modeling and evaluating the time overhead induced by BER in COMA multiprocessors
Journal of Systems Architecture: the EUROMICRO Journal
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fast and transparent recovery for continuous availability of cluster-based servers
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Log-based rollback recovery without checkpoints of shared memory in software DSM
The Journal of Supercomputing
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers
Journal of Experimental Algorithmics (JEA)
Achieving causal and total ordering in publish/subscribe middleware with DSM
Proceedings of the 3rd workshop on Middleware for service oriented computing
Engineering Distributed Shared Memory Middleware for Java
OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Lightweight logging and recovery for distributed shared memory over virtual interface architecture
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers
WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
Hi-index | 0.00 |
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent checkpointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointng overheads, and that our management algorithms can be expected to effectively bound the size of the checkpoints and logs for real applications.