Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Thread migration and its applications in distributed shared memory systems
Journal of Systems and Software
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Accelerating shared virtual memory via general-purpose network interface support
ACM Transactions on Computer Systems (TOCS)
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Experiences with VI communication for database storage
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Shared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
CableS: Thread Control and Memory System Extensions for Shared Virtual Memory Clusters
WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
InterWeave: A Middleware System for Distributed Shared State
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
User-Level Communication in Cluster-Based Servers
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
NT-SwiFT: software implemented fault tolerance on windows NT
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Architecture
IEEE Transactions on Parallel and Distributed Systems
Fast and transparent recovery for continuous availability of cluster-based servers
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Log-based rollback recovery without checkpoints of shared memory in software DSM
The Journal of Supercomputing
Lightweight logging and recovery for distributed shared memory over virtual interface architecture
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Hi-index | 0.00 |
A challenging issue in today's server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address this problem in shared virtual memory (SVM) clusters at the programming abstraction layer. We design extensions to an existing SVM protocol that has been tuned for low-latency, high-bandwidth interconnects and SMP nodes and we achieve reliability through dynamic replication of application shared data and protocol information. Our extensions allow us to tolerate single (or multiple, but not simultaneous) node failures. We implement our extensions on a state-of-the-art cluster and we evaluate the common, failure-free case. We find that, although the complexity of our protocol is substantially higher than its failure-free counterpart, by taking advantage of architectural features of modern systems our approach imposes low overhead and can be employed for transparently dealing with system failures.