Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

Authors:
Rosalia Christodoulopoulou;Reza Azimi;Angelos Bilas
Affiliations:
-;-;-
Venue:
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Year:
2003

Citing 22
Cited 5

Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Lightweight logging for lazy release consistent distributed shared memory

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Thread migration and its applications in distributed shared memory systems

Journal of Systems and Software
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Accelerating shared virtual memory via general-purpose network interface support

ACM Transactions on Computer Systems (TOCS)
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Experiences with VI communication for database storage

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Shared Virtual Memory Clusters with Next-Generation Interconnection Networks and Wide Compute Nodes

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
CableS: Thread Control and Memory System Extensions for Shared Virtual Memory Clusters

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
InterWeave: A Middleware System for Distributed Shared State

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
User-Level Communication in Cluster-Based Servers

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
TreadMarks: distributed shared memory on standard workstations and operating systems

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
NT-SwiFT: software implemented fault tolerance on windows NT

WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2

Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Architecture

IEEE Transactions on Parallel and Distributed Systems
Fast and transparent recovery for continuous availability of cluster-based servers

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Log-based rollback recovery without checkpoints of shared memory in software DSM

The Journal of Supercomputing
Lightweight logging and recovery for distributed shared memory over virtual interface architecture

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A challenging issue in today's server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address this problem in shared virtual memory (SVM) clusters at the programming abstraction layer. We design extensions to an existing SVM protocol that has been tuned for low-latency, high-bandwidth interconnects and SMP nodes and we achieve reliability through dynamic replication of application shared data and protocol information. Our extensions allow us to tolerate single (or multiple, but not simultaneous) node failures. We implement our extensions on a state-of-the-art cluster and we evaluate the common, failure-free case. We find that, although the complexity of our protocol is substantially higher than its failure-free counterpart, by taking advantage of architectural features of modern systems our approach imposes low overhead and can be employed for transparently dealing with system failures.