Fast and transparent recovery for continuous availability of cluster-based servers

Authors:
Rosalia Christodoulopoulou;Kaloian Manassiev;Angelos Bilas;Cristiana Amza
Affiliations:
University of Toronto, Canada;University of Toronto, Canada;University of Crete, Greece;University of Toronto, Canada
Venue:
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2006

Citing 20
Cited 2

Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The Rio file cache: surviving operating system crashes

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Lightweight logging for lazy release consistent distributed shared memory

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Free transactions with Rio Vista

Proceedings of the sixteenth ACM symposium on Operating systems principles
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Data Replication Strategies for Fault Tolerance and Availability on Commodity Clusters

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The design and implementation of Zap: a system for migrating computing environments

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading

Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
HTCPNs---Based modelling and evaluation of dynamic computer cluster reconfiguration

CEE-SET'09 Proceedings of the 4th IFIP TC 2 Central and East European conference on Advances in Software Engineering Techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

Recently there has been renewed interest in building reliable servers that support continuous application operation. Besides maintaining system state consistent after a failure, one of the main challenges in achieving continuous operation is to provide fast reconfiguration. The complexity of the failure reconfiguration mechanisms employed and their overheads depend on the type of platform that is being used as a server and the types of applications that need to be supported. In this paper we focus on providing support for shared-memory applications running on clusters of commodity nodes and interconnects. Achieving continuous operation for shared memory applications on clusters presents two main challenges. (a) The fault tolerance mechanisms employed should be transparent to applications and should have low overhead during failure-free execution. (b) When failures occur, reconfiguration should occur with minimum application disruption without requiring the full recovery of the failed node.In this work we examine in detail the latter, i.e., (b), the failure reconfiguration path. We use a previously developed system [8] that achieves (a) by using dynamic replication of data to the memories of multiple nodes of the system during execution. We examine in detail how the runtime system can achieve minimum application interruption, when failures occur. We present the design and implementation of FineFRC (Fine-grained Failure Recon guration on Clusters), a runtime system for achieving continuous operation of shared memory applications on commodity clusters without requiring application instrumentation or human intervention. We present results using a working, 16-processor system that achieves sub-second failure reconfiguration times.