Software-Based Replication for Fault Tolerance

Authors:
Rachid Guerraoui;André Schiper
Affiliations:
-;-
Venue:
Computer
Year:
1997

Citing 9
Cited 76

Linearizability: a correctness condition for concurrent objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Using process groups to implement failure detection in asynchronous environments

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Lightweight causal and atomic group multicast

ACM Transactions on Computer Systems (TOCS)
The process group approach to reliable distributed computing

Communications of the ACM
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Replication management using the state-machine approach

Distributed systems (2nd Ed.)
The primary-backup approach

Distributed systems (2nd Ed.)
Consensus service: a modular approach for building agreement protocols in distributed systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)

System support for object groups

Proceedings of the 13th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Multi-μ: an Ada 95 based architecture for fault tolerance support of real-time systems

Proceedings of the 1998 annual ACM SIGAda international conference on Ada
The Hector Distributed Run-Time Environment

IEEE Transactions on Parallel and Distributed Systems
An open framework for reliable distributed computing

ACM Computing Surveys (CSUR)
The Generic Consensus Service

IEEE Transactions on Software Engineering
Group communication specifications: a comprehensive study

ACM Computing Surveys (CSUR)
Moshe: A group membership service for WANs

ACM Transactions on Computer Systems (TOCS)
Garf: A Tool for Programming Reliable Distributed Applications

IEEE Parallel & Distributed Technology: Systems & Technology
The Database State Machine Approach

Distributed and Parallel Databases
Abstracting Services in a Heterogeneous Environment

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Quorum-Based Replication in Asynchronous Crash-Recovery Distributed Systems (Research Note)

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
A Fault-Tolerant Sequencer for Timed Asynchronous Systems

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
OpenCorba: A Reflective Open Broker

Reflection '99 Proceedings of the Second International Conference on Meta-Level Architectures and Reflection
Using Agent Replication to Enhance Reliability and Availability of Multi-agent Systems

AI '02 Proceedings of the 15th Conference of the Canadian Society for Computational Studies of Intelligence on Advances in Artificial Intelligence
Integrating Group Communication with Transactions for Implementing Persistent Replicated Objects

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Replication of CORBA Objects

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Programming Partition-Aware Network Applications

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Improving Scalability of Replicated Services in Mobile Agent Systems

MA '02 Proceedings of the 6th International Conference on Mobile Agents
A Dynamic Replica Selection Algorithm for Tolerating Timing Faults

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
On the Provision of Replicated Internet Auction Services

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Three-tier replication for FT-CORBA infrastructures

Software—Practice & Experience
Optimistic Active Replication

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems and Its Use in Quorum-Based Replication

IEEE Transactions on Knowledge and Data Engineering
Reliable Distributed Network Management by Replication

Journal of Network and Systems Management
Reliable Peer-to-Peer End System Multicasting through Replication

P2P '04 Proceedings of the Fourth International Conference on Peer-to-Peer Computing
Implementing a replicated service with group communication

Journal of Systems Architecture: the EUROMICRO Journal
Fault tolerant algorithm based on dynamic and active load balancing for redundant services

Journal of Computer Science and Technology
Dynamic data replication and consistency in mobile environments

DSM '05 Proceedings of the 2nd international doctoral symposium on Middleware
Experience and prospects for various control strategies for self-replicating multi-agent systems

Proceedings of the 2006 international workshop on Self-adaptation and self-managing systems
DimaX: a fault-tolerant multi-agent platform

Proceedings of the 2006 international workshop on Software engineering for large-scale multi-agent systems
On fault tolerance in law-governed multi-agent systems

Proceedings of the 2006 international workshop on Software engineering for large-scale multi-agent systems
A classification of total order specifications and its application to fixed sequencer-based implementations

Journal of Parallel and Distributed Computing
From spontaneous total order to uniform total order: different degrees of optimistic delivery

Proceedings of the 2006 ACM symposium on Applied computing
Revisiting 1-copy equivalence in clustered databases

Proceedings of the 2006 ACM symposium on Applied computing
Fully Distributed Three-Tier Active Software Replication

IEEE Transactions on Parallel and Distributed Systems
A Predictive Method for Providing Fault Tolerance in Multi-agent Systems

IAT '06 Proceedings of the IEEE/WIC/ACM international conference on Intelligent Agent Technology
A software engineering approach for the development of heterogeneous robotic applications

Robotics and Computer-Integrated Manufacturing
The co-replication methodology and its application to structured parallel programs

Proceedings of the 2007 symposium on Component and framework technology in high-performance and scientific computing
A survey of linguistic structures for application-level fault tolerance

ACM Computing Surveys (CSUR)
Data and code integrity in Grid environments

SMO'06 Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization
On Fault Tolerance in Law-Governed Multi-agent Systems

Software Engineering for Multi-Agent Systems V
Annotation Markers for Runtime Replication Protocol Selection

ATC '08 Proceedings of the 5th international conference on Autonomic and Trusted Computing
Model-Driven Adaptive Self-healing for Autonomic Computing

MACE '08 Proceedings of the 3rd IEEE international workshop on Modelling Autonomic Communications Environments
Computing the fault tolerance of multi-agent deployment

Artificial Intelligence
DTR: Distributed Transaction Routing in a Large Scale Network

High Performance Computing for Computational Science - VECPAR 2008
A step towards a new generation of group communication systems

Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
A Redundancy Protocol for Service-Oriented Architectures

Service-Oriented Computing --- ICSOC 2008 Workshops
Characterizing fault tolerance in genetic programming

BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
FT-OSGi: Fault Tolerant Extensions to the OSGi Service Platform

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Database replication in large scale systems: optimizing the number of replicas

Proceedings of the 2009 EDBT/ICDT Workshops
Semi-passive replication and Lazy Consensus

Journal of Parallel and Distributed Computing
Low-cost fault-tolerance protocol for large-scale network monitoring

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
DARX: a self-healing framework for agents

Proceedings of the 12th Monterey conference on Reliable systems on unreliable networked platforms
Design and performance of a generic consensus component for critical distributed applications

Ada-Europe'07 Proceedings of the 12th international conference on Reliable software technologies
Characterizing fault tolerance in genetic programming

Future Generation Computer Systems
Exploiting commutativity for efficient replication in partitionable distributed systems

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems - Volume Part II
Reviewing amnesia support in database recovery protocols

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
Towards reliable multi-agent systems: An adaptive replication mechanism

Multiagent and Grid Systems
Dynamic and adaptive replication for large-scale reliable multi-agent systems

Software engineering for large-scale multi-agent systems
Dynamic service quality and resource negotiation for high-availability service-oriented systems

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
Best-effort group service in dynamic networks

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Distributed and fault-tolerant execution framework for transaction processing

Proceedings of the 4th Annual International Conference on Systems and Storage
Rectifying orphan components using group-failover in distributed real-time and embedded systems

Proceedings of the 14th international ACM Sigsoft symposium on Component based software engineering
Separating computation and storage with storage virtualization

Computer Communications
Adaptive Replication in Fault-Tolerant Multi-agent Systems

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 02
Run-time switching between total order algorithms

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Towards a generic group communication service

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part II
Group-Based replication of on-line transaction processing servers

LADC'05 Proceedings of the Second Latin-American conference on Dependable Computing
Replication techniques for availability

Replication
Increasing availability in a replicated partitionable distributed object system

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Detecting and tolerating failures in a loosely integrated heterogeneous database system

Computer Communications
H: A component-based specification language for heterogeneous applications

Computer Standards & Interfaces
Fault-tolerant fault tolerance for component-based automation systems

Proceedings of the 4th international ACM Sigsoft symposium on Architecting critical systems
Representing dynamic pluggable software units

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Self-stabilizing iterative solvers

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	4.10

Visualization

Abstract

Developers of early distributed systems took a simplistic approach to providing fault tolerance: They just used another copy of the same hardware as a backup. Later, others developed replication software to work on off-the-shelf hardware. Since neither of these methods is especially economical, a logical course is to take it one step further and eliminate the extra hardware altogether. Fully software-based replication relies on sophisticated techniques to keep track of server communications and ensure the consistency of information across several server replicas. How do you know that each server shares the same view of the data or program semantics? What happens if a server replica crashes? How do you make sure that a system processes invocations in the correct order? These are all problems that a replication technique has to handle. The authors describe two fundamental techniques, primary-backup and active replication, and illustrate how they handle these problems. At this point, both have advantages and disadvantages that depend on the application. The authors also propose that group communication provides a sufficient framework for implementing software-based replication. The concept of static and dynamic groups proves useful in thinking about how to implement replication techniques. Replication techniques can also use total-order and view-synchronous multicast primitives from group communication.