Scalable fault-tolerant distributed shared memory

Authors:
Florin Sultan;Liviu Iftode;Thu Nguyen
Affiliations:
Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ;Department of Computer Science, Rutgers University, Piscataway, NJ
Venue:
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Year:
2000

Citing 24
Cited 14

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Hiding communication latency and coherence overhead in software DSMs

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Lightweight logging for lazy release consistent distributed shared memory

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
The Legion vision of a worldwide virtual computer

Communications of the ACM
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Cashmere-2L: software coherent shared memory on a clustered remote-write network

Proceedings of the sixteenth ACM symposium on Operating systems principles
MultiView and Millipage — fine-grain sharing in page-based DSMs

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
A Case for NOW (Networks of Workstations)

IEEE Micro
Software Support for Virtual Memory-Mapped Communication

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The performance of consistent checkpointing in distributed shared memory systems

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Home-based shared virtual memory

Home-based shared virtual memory

ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Modeling and evaluating the time overhead induced by BER in COMA multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fast and transparent recovery for continuous availability of cluster-based servers

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Log-based rollback recovery without checkpoints of shared memory in software DSM

The Journal of Supercomputing
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers

Journal of Experimental Algorithmics (JEA)
Achieving causal and total ordering in publish/subscribe middleware with DSM

Proceedings of the 3rd workshop on Middleware for service oriented computing
Engineering Distributed Shared Memory Middleware for Java

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Lightweight logging and recovery for distributed shared memory over virtual interface architecture

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent checkpointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointng overheads, and that our management algorithms can be expected to effectively bound the size of the checkpoints and logs for real applications.