Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The Virtual Interface Architecture
IEEE Micro
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
An Efficient Logging Scheme for Recoverable Distributed Shared Memory Systems
ICDCS '97 Proceedings of the 17th International Conference on Distributed Computing Systems (ICDCS '97)
Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Log-based rollback recovery without checkpoints of shared memory in software DSM
The Journal of Supercomputing
Hi-index | 0.01 |
As software Distributed Shared Memory(DSM) systems become attractive on larger clusters, the focus of attention moves toward improving the reliability of systems. In this paper, we propose a lightweight logging scheme, called remote logging, and a recovery protocol for home-based DSM. Remote logging stores coherence-related data to the volatile memory of a remote node. The logging overhead can be moderated with high-speed system area network and user-level DMA operations supported by modern communication protocols. Remote logging tolerates multiple failures if the backup nodes of failed nodes are alive. It makes the reliability of DSM grow much higher. Experimental results show that our fault-tolerant DSM has low overhead compared to conventional stable logging and it can be effectively recovered from some concurrent failures.