Implementing fault-tolerant services using the state machine approach: a tutorial

Authors:
Fred B. Schneider
Affiliations:
Cornell Univ., Ithaca, NY
Venue:
ACM Computing Surveys (CSUR)
Year:
1990

Citing 23
Cited 356

Fault-tolerant broadcasts

Science of Computer Programming
Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed systems: methods and tools for specification. An advanced course

Distributed systems: methods and tools for specification. An advanced course
Applications of Byzantine agreement in database systems

ACM Transactions on Database Systems (TODS)
Reliable communication in the presence of failures

ACM Transactions on Computer Systems (TOCS)
Highly available distributed services and fault-tolerant distributed garbage collection

PODC '86 Proceedings of the fifth annual ACM symposium on Principles of distributed computing
Design of the x-kernel

SIGCOMM '88 Symposium proceedings on Communications architectures and protocols
Towards a theory of replicated processing

Proceedings of a Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems
Reliable scheduling in a TMR database system

ACM Transactions on Computer Systems (TOCS)
Preserving and using context information in interprocess communication

ACM Transactions on Computer Systems (TOCS)
Early-delivery atomic broadcast

PODC '90 Proceedings of the ninth annual ACM symposium on Principles of distributed computing
Designing distributed services using refinement mappings

Designing distributed services using refinement mappings
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
Synchronization in Distributed Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Byzantine generals in action: implementing fail-stop processors

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Self-stabilizing systems in spite of distributed control

Communications of the ACM
Notes on Data Base Operating Systems

Operating Systems, An Advanced Course
Byzantine clock synchronization

PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing
Fault-tolerant clock synchronization

PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing

Early-delivery atomic broadcast

PODC '90 Proceedings of the ninth annual ACM symposium on Principles of distributed computing
Principal Features of the VOLTAN Family of Reliable Node Architectures for Distributed Systems

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
Causal controversy at Le Mont St.-Michel

ACM SIGOPS Operating Systems Review
High availability in a real-time system

ACM SIGOPS Operating Systems Review
The process group approach to reliable distributed computing

Communications of the ACM
Unifying self-stabilization and fault-tolerance

PODC '93 Proceedings of the twelfth annual ACM symposium on Principles of distributed computing
TTP-A Protocol for Fault-Tolerant Real-Time Systems

Computer
How to securely replicate services

ACM Transactions on Programming Languages and Systems (TOPLAS)
Secure agreement protocols: reliable and atomic group multicast in rampart

CCS '94 Proceedings of the 2nd ACM Conference on Computer and communications security
A security architecture for fault-tolerant systems

ACM Transactions on Computer Systems (TOCS) - Special issue on computer architecture
Supporting Fault-Tolerant Parallel Programming in Linda

IEEE Transactions on Parallel and Distributed Systems
Programming Language Support for Writing Fault-Tolerant Distributed Software

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
A highly available scalable ITV system

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Distributing trust with the Rampart toolkit

Communications of the ACM
From group communication to transactions in distributed systems

Communications of the ACM
A Secure Group Membership Protocol

IEEE Transactions on Software Engineering
The Ω key management service

CCS '96 Proceedings of the 3rd ACM conference on Computer and communications security
Implementing Fail-Silent Nodes for Distributed Systems

IEEE Transactions on Computers
Efficient message ordering in dynamic networks

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Comparing primary-backup and state machines for crash failures

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Specifying and using a partitionable group communication service

PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
Probabilistic quorum systems

PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
Design and Evaluation of a Window-Consistent Replication Service

IEEE Transactions on Computers
Path independence for authentication in large-scale systems

Proceedings of the 4th ACM conference on Computer and communications security
Cloning: a novel method for interactive parallel simulation

Proceedings of the 29th conference on Winter simulation
Fault tolerance in distributed Ada 95

IRTAW '97 Proceedings of the eighth international workshop on Real-Time Ada
Synthesis of fault-tolerant concurrent programs

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
Dynamic virtual logical processes

PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Fault-tolerant wait-free shared objects

Journal of the ACM (JACM)
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Designing Masking Fault-Tolerance via Nonmasking Fault-Tolerance

IEEE Transactions on Software Engineering
Multi-μ: an Ada 95 based architecture for fault tolerance support of real-time systems

Proceedings of the 1998 annual ACM SIGAda international conference on Ada
Coyote: a system for constructing fine-grain configurable communication services

ACM Transactions on Computer Systems (TOCS)
An evaluation of flow control in group communication

IEEE/ACM Transactions on Networking (TON)
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Client-Access Protocols for Replicated Services

IEEE Transactions on Software Engineering
Resilient Authentication Using Path Independence

IEEE Transactions on Computers
A Real-Time Primary-Backup Replication Service

IEEE Transactions on Parallel and Distributed Systems
Fundamentals of fault-tolerant distributed computing in asynchronous environments

ACM Computing Surveys (CSUR)
Replicated invocations in wide-area systems

Proceedings of the 8th ACM SIGOPS European workshop on Support for composing distributed applications
Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems

IEEE Transactions on Computers
An architecture for distributed OASIS services

IFIP/ACM International Conference on Distributed systems platforms
Efficient atomic broadcast using deterministic merge

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Specifying and using a partitionable group communication service

ACM Transactions on Computer Systems (TOCS)
Consensus-based fault-tolerant total order multicast

IEEE Transactions on Parallel and Distributed Systems
Lamport on mutual exclusion: 27 years of planting seeds

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
BASE: using abstraction to improve fault tolerance

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Group communication specifications: a comprehensive study

ACM Computing Surveys (CSUR)
On the cost of fault-tolerant consensus when there are no faults: preliminary version

ACM SIGACT News
High availability in a real-time system

EW 5 Proceedings of the 5th workshop on ACM SIGOPS European workshop: Models and paradigms for distributed systems structuring
Survival by defense-enabling

Proceedings of the 2001 workshop on New security paradigms
Cloning parallel simulations

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Design and evaluation of a conit-based continuous consistency model for replicated services

ACM Transactions on Computer Systems (TOCS)
Practical byzantine fault tolerance and proactive recovery

ACM Transactions on Computer Systems (TOCS)
Active disk paxos with infinitely many processes

Proceedings of the twenty-first annual symposium on Principles of distributed computing
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
Auditing Causal Relationships of Group Multicast Communications in Group-Oriented Distributed Systems

The Journal of Supercomputing
Distributed Fault Tolerance: Lessons from Delta-4

IEEE Micro
An Architecture for Survivable Coordination in Large Distributed Systems

IEEE Transactions on Knowledge and Data Engineering
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
Consensus-Based Fault-Tolerant Total Order Multicast

IEEE Transactions on Parallel and Distributed Systems
On Group Communication Support in CORBA

IEEE Transactions on Parallel and Distributed Systems
Structuring Fault-Tolerant Object Systems for Modularity in a Distributed Environment

IEEE Transactions on Parallel and Distributed Systems
Specifying and Verifying Requirements of Real-Time Systems

IEEE Transactions on Software Engineering
The Database State Machine Approach

Distributed and Parallel Databases
Design and Verification of Distributed Recovery Blocks with CSP

Formal Methods in System Design
Exception handling and resolution for transactional object groups

Advances in exception handling techniques
Addressing Scalability Issues Using the CLF Middleware

EDOC '01 Proceedings of the 5th IEEE International Conference on Enterprise Distributed Object Computing
Online Non-stop Software Update Using Replicated Execution Blocks

COMPSAC '00 24th International Computer Software and Applications Conference
Disk Paxos

DISC '00 Proceedings of the 14th International Conference on Distributed Computing
Scalable Secure Storage when Half the System Is Faulty

ICALP '00 Proceedings of the 27th International Colloquium on Automata, Languages and Programming
Fault Tolerance by Transparent Replication for Distributed Ada 95

Ada-Europe '99 Proceedings of the 1999 Ada-Europe International Conference on Reliable Software Technologies
How to Modify the GNAT Frontend tp Experiment with Ada Extensions

Ada-Europe '99 Proceedings of the 1999 Ada-Europe International Conference on Reliable Software Technologies
Building Robust Applications by Reusing Non-robust Legacy Software

Ada Europe '01 Proceedings of the 6th Ade-Europe International Conference Leuven on Reliable Software Technologies
Transparent Environment for Replicated Ravenscar Applications

Ada-Europe '02 Proceedings of the 7th Ada-Europe International Conference on Reliable Software Technologies
A Tailorable Distributed Programming Environment

Ada-Europe '02 Proceedings of the 7th Ada-Europe International Conference on Reliable Software Technologies
Building TMR-Based Reliable Servers Despite Bounded Input Lifetimes

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
MetaJava - A Platform for Adaptable Operating-System Mechanisms

ECOOP '97 Proceedings of the Workshops on Object-Oriented Technology
Bus Architectures for Safety-Critical Embedded Systems

EMSOFT '01 Proceedings of the First International Workshop on Embedded Software
An Overview of Formal Verification for the Time-Triggered Architecture

FTRTFT '02 Proceedings of the 7th International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems: Co-sponsored by IFIP WG 2.2
Agreement Problems in Fault-Tolerant Distributed Systems

SOFSEM '01 Proceedings of the 28th Conference on Current Trends in Theory and Practice of Informatics Piestany: Theory and Practice of Informatics
Broadening the Scope of Fault Tolerance within Secure Services

Revised Papers from the 8th International Workshop on Security Protocols
Exception Handling and Resolution for Transactional Object Groups

Advances in Exception Handling Techniques (the book grow out of a ECOOP 2000 workshop)
Topology-Aware Algorithms for Large-Scale Communication

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Integrating Group Communication with Transactions for Implementing Persistent Replicated Objects

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Programming Partition-Aware Network Applications

Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems
Improving Scalability of Replicated Services in Mobile Agent Systems

MA '02 Proceedings of the 6th International Conference on Mobile Agents
Middleware Support for Voting and Data Fusion

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Distributing Trust on the Internet

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
The Design and Use of Persistent Memory on the DNCP Hardware Fault-Tolerant Platform

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Byzantine Fault Tolerance Can Be Fast

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Secure and Highly Available Distributed Store for Meeting Diverse Data Storage Needs

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Generic Broadcast

Proceedings of the 13th International Symposium on Distributed Computing
Atomic Data Access in Distributed Hash Tables

IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
The Bancomat problem: an example of resource allocation in a partitionable asynchronous system

Theoretical Computer Science - Special issue: Distributed computing
Reconfiguration and transient recovery in state machine architectures

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault-Tolerance: Java's Missing Buzzword

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Using Replication and Partitioning to Build Secure Distributed Systems

SP '03 Proceedings of the 2003 IEEE Symposium on Security and Privacy
A Method for Combining Replication with Caching

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Responsive Security for Stored Data

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
A Replication Technique Based on a Functional and Attribute Grammar Computation Model

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
BASE: Using abstraction to improve fault tolerance

ACM Transactions on Computer Systems (TOCS)
The ELEKTRA Railway Signalling-System: Field Experience with an Actively Replicated System with Diversity

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault Tolerance in Safety Critical Automotive Applications: Cost of Agreement as a Limiting Factor

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Backoff Protocols for Distributed Mutual Exclusion and Ordering

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
An algorithm for Supporting Fault Tolerant Objects in Distributed Object-Oriented Operating Systems

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Filtering Duplicated Invocations Using Symmetric Proxies

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Separating agreement from execution for byzantine fault tolerant services

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Timeout-Based Message Ordering Protocol for a Lightweight Software Implementation of TMR Systems

IEEE Transactions on Parallel and Distributed Systems
Synthesis of fault-tolerant concurrent programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed communication in ML

Journal of Functional Programming
Replication Management in Reliable Real-Time Systems

Real-Time Systems
A weakest failure detector-based asynchronous consensus protocol for f

Information Processing Letters
An analysis of update ordering in distributed replication systems

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Highly available, fault-tolerant, parallel dataflows

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Reliable Distributed Network Management by Replication

Journal of Network and Systems Management
The weakest failure detectors to solve certain fundamental problems in distributed computing

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Implementing a replicated service with group communication

Journal of Systems Architecture: the EUROMICRO Journal
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Handling message semantics with Generic Broadcast protocols

Distributed Computing
Replication algorithms for the World-Wide Web

Journal of Systems Architecture: the EUROMICRO Journal
Total order broadcast and multicast algorithms: Taxonomy and survey

ACM Computing Surveys (CSUR)
The Guardian Model and Primitives for Exception Handling in Distributed Systems

IEEE Transactions on Software Engineering
Consistent and automatic replica regeneration

ACM Transactions on Storage (TOS)
Comparison of Database Replication Techniques Based on Total Order Broadcast

IEEE Transactions on Knowledge and Data Engineering
Geographically Distributed System for Catastrophic Recovery

LISA '02 Proceedings of the 16th USENIX conference on System administration
Simple and Efficient Oracle-Based Consensus Protocols for Asynchronous Byzantine Systems

IEEE Transactions on Dependable and Secure Computing
Disk Paxos

Distributed Computing
Architectural support for mode-driven fault tolerance in distributed applications

WADS '05 Proceedings of the 2005 workshop on Architecting dependable systems
Plutus: Scalable Secure File Sharing on Untrusted Storage

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Implementing Trustworthy Services Using Replicated State Machines

IEEE Security and Privacy
BAR fault tolerance for cooperative services

Proceedings of the twentieth ACM symposium on Operating systems principles
Fault-scalable Byzantine fault-tolerant services

Proceedings of the twentieth ACM symposium on Operating systems principles
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
FTWeb: A Fault Tolerant Infrastructure for Web Services

EDOC '05 Proceedings of the Ninth IEEE International EDOC Enterprise Computing Conference
Dynamic data replication and consistency in mobile environments

DSM '05 Proceedings of the 2nd international doctoral symposium on Middleware
From Set Membership to Group Membership: A Separation of Concerns

IEEE Transactions on Dependable and Secure Computing
Active Replication of Multithreaded Applications

IEEE Transactions on Parallel and Distributed Systems
Trust but verify: accountability for network services

Proceedings of the 11th workshop on ACM SIGOPS European workshop
WS-replication: a framework for highly available web services

Proceedings of the 15th international conference on World Wide Web
BTS: a Byzantine fault-tolerant tuple space

Proceedings of the 2006 ACM symposium on Applied computing
Active disk Paxos with infinitely many processes

Distributed Computing - Special issue: PODC 02
MobiEyes: A Distributed Location Monitoring Service Using Moving Location Queries

IEEE Transactions on Mobile Computing
Fast Byzantine Consensus

IEEE Transactions on Dependable and Secure Computing
The SMART way to migrate replicated stateful services

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Tashkent: uniting durability with transaction ordering for high-performance scalable database replication

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Behaviour Abstraction for Communicating Sequential Processes

Fundamenta Informaticae
Specifying and using intrusion masking models to process distributed operations

Journal of Computer Security
Design and implementation of a secure wide-area object middleware

Computer Networks: The International Journal of Computer and Telecommunications Networking
Tight bounds for asynchronous randomized consensus

Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
Static analysis meets distributed fault-tolerance: enabling state-machine replication with nondeterminism

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
The case for Byzantine fault detection

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
The phoenix recovery system: rebuilding from the ashes of an internet catastrophe

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Secure data replication over untrusted hosts

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Consistent and automatic replica regeneration

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Proactive recovery in a Byzantine-fault-tolerant system

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Chain replication for supporting high throughput and availability

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Parsimonious Approach for Obtaining Resource-Efficient and Trustworthy Execution

IEEE Transactions on Dependable and Secure Computing
Unified support for heterogeneous security policies in distributed systems

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Implementing causal logging using OrbixWeb interception

COOTS'99 Proceedings of the 5th conference on USENIX Conference on Object-Oriented Technologies & Systems - Volume 5
Filterfresh: hot replication of java RMI server objects

COOTS'98 Proceedings of the 4th conference on USENIX Conference on Object-Oriented Technologies and Systems - Volume 4
Tcl-DP name server

TCLTK '98 Proceedings of the 3rd Annual USENIX Workshop on Tcl/Tk - Volume 3
Asynchronous Agreement and Its Relation with Error-Correcting Codes

IEEE Transactions on Computers
Environmentally responsible middleware:: an altruistic behavior model for distributed middleware components

Proceedings of the 16th international symposium on High performance distributed computing
Tashkent+: memory-aware load balancing and update filtering in replicated databases

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Paxos made live: an engineering perspective

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
The coBFIT toolkit

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Strong accountability for network storage

ACM Transactions on Storage (TOS)
Zyzzyva: speculative byzantine fault tolerance

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
PeerReview: practical accountability for distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Attested append-only memory: making adversaries stick to their word

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
HQ replication: a hybrid quorum protocol for byzantine fault tolerance

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Design of a cheat-resistant P2P online gaming system

Proceedings of the 2nd international conference on Digital interactive media in entertainment and arts
Exploiting type-awareness in a self-recovering disk

Proceedings of the 2007 ACM workshop on Storage security and survivability
Flexible intrusion tolerant voting architecture

Proceedings of the 2007 ACM workshop on Scalable trusted computing
Pronto: High availability for standard off-the-shelf databases

Journal of Parallel and Distributed Computing
A survey of linguistic structures for application-level fault tolerance

ACM Computing Surveys (CSUR)
DepSpace: a byzantine fault-tolerant coordination service

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Optimistic transactional active replication

Proceedings of the 2nd international conference on Ubiquitous information management and communication
Fingerpointing correlated failures in replicated systems

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
Conflict-aware load-balancing techniques for database replication

Proceedings of the 2008 ACM symposium on Applied computing
Data and code integrity in Grid environments

SMO'06 Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization
Got predictability?: experiences with fault-tolerant middleware

Proceedings of the 2007 ACM/IFIP/USENIX international conference on Middleware companion
Replica placement for high availability in distributed stream processing systems

Proceedings of the second international conference on Distributed event-based systems
Nysiad: practical protocol transformation to tolerate Byzantine failures

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Zyzzyva: speculative Byzantine fault tolerance

Communications of the ACM - Remembering Jim Gray
Virtual infrastructure for collision-prone wireless networks

Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Randomized consensus in expected O(n log n) individual work

Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing
Research note: On Byzantine generals with alternative plans

Journal of Parallel and Distributed Computing
Tight bounds for asynchronous randomized consensus

Journal of the ACM (JACM)
Preserving the consistency of distributed objects with real-time transactions

NOTERE '08 Proceedings of the 8th international conference on New technologies in distributed systems
Handling Emergent Nondeterminism in Replicated Services

Architecting Dependable Systems V
Programming with Live Distributed Objects

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Optimizing Threshold Protocols in Adversarial Structures

DISC '08 Proceedings of the 22nd international symposium on Distributed Computing
Showing correctness of a replication algorithm in a component based system

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Fault-tolerant stream processing using a distributed, replicated file system

Proceedings of the VLDB Endowment
Experiences in engineering active replication into a traditional three-tiered client-server system

Proceedings of the 2008 RISE/EFTS Joint International Workshop on Software Engineering for Resilient Systems
Solving Atomic Multicast When Groups Crash

OPODIS '08 Proceedings of the 12th International Conference on Principles of Distributed Systems
Reliability versus performance for critical applications

Journal of Parallel and Distributed Computing
Living with nondeterminism in replicated middleware applications

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
A simple totally ordered broadcast protocol

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Paxos for System Builders: an overview

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Configuration-space performance anomaly depiction

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Reducing the costs of large-scale BFT replication

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Design and implementation of a Byzantine fault tolerance framework for Web services

Journal of Systems and Software
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Tolerating latency in replicated state machines through client speculation

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
A Generic Group Communication Approach for Hybrid Distributed Systems

DAIS '09 Proceedings of the 9th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems
Dynamic atomic storage without consensus

Proceedings of the 28th ACM symposium on Principles of distributed computing
Building reliable large-scale distributed systems: when theory meets practice

ACM SIGACT News
Upright cluster services

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Symmetric active/active metadata service for high availability parallel file systems

Journal of Parallel and Distributed Computing
Ripley: automatically securing web 2.0 applications through replicated execution

Proceedings of the 16th ACM conference on Computer and communications security
Zyzzyva: Speculative Byzantine fault tolerance

ACM Transactions on Computer Systems (TOCS)
A Decidable Probability Logic for Timed Probabilistic Systems

Fundamenta Informaticae
The Design of Finite State Machine for Asynchronous Replication Protocol

ICIC '07 Proceedings of the 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence
Weak Synchrony Models and Failure Detectors for Message Passing (k-)Set Agreement

OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
Proactive Fortification of Fault-Tolerant Services

OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
A novel approach for component-based fault-tolerant software development

Information and Software Technology
The reliability analysis of resiliency framework for Grid Services

ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
Predicting and preventing inconsistencies in deployed distributed systems

ACM Transactions on Computer Systems (TOCS)
Semi-passive replication and Lazy Consensus

Journal of Parallel and Distributed Computing
Reconfiguring a state machine

ACM SIGACT News
Policy-based access control for weakly consistent replication

Proceedings of the 5th European conference on Computer systems
A pattern-based approach for modeling and analyzing error recovery

Architecting dependable systems IV
A scalable and secure cryptographic service

Proceedings of the 21st annual IFIP WG 11.3 working conference on Data and applications security
Byzantine consensus with few synchronous links

OPODIS'07 Proceedings of the 11th international conference on Principles of distributed systems
Fault tolerance in finite state machines using fusion

ICDCN'08 Proceedings of the 9th international conference on Distributed computing and networking
Lithium: virtual machine storage for the cloud

Proceedings of the 1st ACM symposium on Cloud computing
Towards a practical approach to confidential Byzantine fault tolerance

Future directions in distributed computing
A data-centric approach for scalable state machine replication

Future directions in distributed computing
Best-effort group service in dynamic networks

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Proactive obfuscation

ACM Transactions on Computer Systems (TOCS)
Throughput optimal total order broadcast for cluster environments

ACM Transactions on Computer Systems (TOCS)
Enabling replication in the ASSISTANT programming model

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Scalable byzantine computation

ACM SIGACT News
The byzantine empire in the intercloud

ACM SIGACT News
Prophecy: using history for high-throughput fault tolerance

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Mencius: building efficient replicated state machines for WANs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
The failure detector abstraction

ACM Computing Surveys (CSUR)
Implementing fault-tolerant services using state machines: beyond replication

DISC'10 Proceedings of the 24th international conference on Distributed computing
Programming distributed systems with group IO

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing
The design of a practical system for fault-tolerant virtual machines

ACM SIGOPS Operating Systems Review
Scalable virtual machine storage using local disks

ACM SIGOPS Operating Systems Review
The case for determinism in database systems

Proceedings of the VLDB Endowment
Declarative configuration management for complex and dynamic networks

Proceedings of the 6th International COnference
Storyboard: optimistic deterministic multithreading

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Active quorum systems

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Deterministic process groups in dOS

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Efficient system-enforced deterministic parallelism

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scalable transactions in the cloud: partitioning revisited

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Synoptic: summarizing system logs with refinement

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Putting events in context: aspects for event-based distributed programming

Proceedings of the tenth international conference on Aspect-oriented software development
DieCast: Testing Distributed Systems with an Accurate Scale Model

ACM Transactions on Computer Systems (TOCS)
Paxos replicated state machines as the basis of a high-performance data store

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Plutus: scalable secure file sharing on untrusted storage

FAST'03 Proceedings of the 2nd USENIX conference on File and storage technologies
The role of accountability in dependable distributed systems

HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Managing self-inflicted nondeterminism

HotDep'05 Proceedings of the First conference on Hot topics in system dependability
Static analysis meets distributed fault-tolerance: enabling state-machine replication with nondeterminism

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
The case for byzantine fault detection

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
Beyond one-third faulty replicas in byzantine fault tolerant systems

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Distributed and fault-tolerant execution framework for transaction processing

Proceedings of the 4th Annual International Conference on Systems and Storage
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A layered approach for identifying systematic faults of component-based software systems

Proceedings of the 16th international workshop on Component-oriented programming
Multi-writer regular registers in dynamic distributed systems with byzantine failures

Proceedings of the 3rd International Workshop on Theoretical Aspects of Dynamic Distributed Systems
Scalable consistency in Scatter

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting and surviving data races using complementary schedules

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
An algorithm for implementing BFT registers in distributed systems with bounded churn

SSS'11 Proceedings of the 13th international conference on Stabilization, safety, and security of distributed systems
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Living with nondeterminism in replicated middleware applications

Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Extending the UMIOP specification for reliable multicast in CORBA

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
Integrating the ROMIOP and ETF specifications for atomic multicast in CORBA

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
Group communication: from practice to theory

SOFSEM'06 Proceedings of the 32nd conference on Current Trends in Theory and Practice of Computer Science
Run-time switching between total order algorithms

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Faults in large distributed systems and what we can do about them

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Replication predicates for dependent-failure algorithms

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Behavioral distance for intrusion detection

RAID'05 Proceedings of the 8th international conference on Recent Advances in Intrusion Detection
Commensal cuckoo: secure group partitioning for large-scale services

ACM SIGOPS Operating Systems Review
From paxos to CORFU: a flash-speed shared log

ACM SIGOPS Operating Systems Review
Whole-system persistence

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
A formal model for fault-tolerance in distributed systems

SAFECOMP'05 Proceedings of the 24th international conference on Computer Safety, Reliability, and Security
TrustedPals: secure multiparty computation implemented with smart cards

ESORICS'06 Proceedings of the 11th European conference on Research in Computer Security
A fault tolerant system using collaborative agents

TAINN'05 Proceedings of the 14th Turkish conference on Artificial Intelligence and Neural Networks
Parsimonious asynchronous byzantine-fault-tolerant atomic broadcast

OPODIS'05 Proceedings of the 9th international conference on Principles of Distributed Systems
Behavioral distance measurement using hidden markov models

RAID'06 Proceedings of the 9th international conference on Recent Advances in Intrusion Detection
Architecting and implementing versatile dependability

Architecting Dependable Systems III
The lost art of abstraction

Architecting Dependable Systems III
Dependable systems

Dependable Systems
Improving server applications with system transactions

Proceedings of the 7th ACM european conference on Computer Systems
Stumbling over consensus research: misunderstandings and issues

Replication
Replicating for performance: case studies

Replication
A history of the virtual synchrony replication model

Replication
From viewstamped replication to byzantine fault tolerance

Replication
Implementing trustworthy services using replicated state machines

Replication
State machine replication with byzantine faults

Replication
Toward survivable SCADA

Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Fused state machines for fault tolerance in distributed systems

OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems
Byzantine fault-tolerance with commutative commands

OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems
A protocol for the atomic capture of multiple molecules on large scale platforms

ICDCN'12 Proceedings of the 13th international conference on Distributed Computing and Networking
Byzantine agreement with homonyms in synchronous systems

ICDCN'12 Proceedings of the 13th international conference on Distributed Computing and Networking
Beyond traces and independence

Dependable and Historic Computing
RESTGroups for resilient web services

SOFSEM'12 Proceedings of the 38th international conference on Current Trends in Theory and Practice of Computer Science
CORFU: a shared log design for flash clusters

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Don't lose sleep over availability: the GreenUp decentralized wakeup service

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Gnothi: separating data and metadata for efficient and available storage replication

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Surviving congestion in geo-distributed storage systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Scalability of replicated metadata services in distributed file systems

DAIS'12 Proceedings of the 12th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Homonyms with forgeable identifiers

SIROCCO'12 Proceedings of the 19th international conference on Structural Information and Communication Complexity
Pushouts in software architecture design

Proceedings of the 11th International Conference on Generative Programming and Component Engineering
Behaviour Abstraction for Communicating Sequential Processes

Fundamenta Informaticae
All about Eve: execute-verify replication for multi-core servers

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Making geo-replicated systems fast as possible, consistent when necessary

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
DMME: A Distributed LTE Mobility Management Entity

Bell Labs Technical Journal
Probabilistic opaque quorum systems

DISC'07 Proceedings of the 21st international conference on Distributed Computing
Formal verification of distributed algorithms: from pseudo code to checked proofs

TCS'12 Proceedings of the 7th IFIP TC 1/WG 202 international conference on Theoretical Computer Science
Replication for linked data

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
Adaptive request batching for byzantine replication

ACM SIGOPS Operating Systems Review
Abstracting context in event-based software

Transactions on Aspect-Oriented Software Development IX
Enhancing group communication with self-manageable behavior

Journal of Parallel and Distributed Computing
A study of unpredictability in fault-tolerant middleware

Computer Networks: The International Journal of Computer and Telecommunications Networking
Churn Tolerance Algorithm for State Machine Replication

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 02
Photon: fault-tolerant and scalable joining of continuous data streams

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
MoSQL: an elastic storage engine for MySQL

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Efficient software-based fault tolerance approach on multicore platforms

Proceedings of the Conference on Design, Automation and Test in Europe
Avoiding disruptive failovers in transaction processing systems with multiple active nodes

Journal of Parallel and Distributed Computing
Rollback-recovery without checkpoints in distributed event processing systems

Proceedings of the 7th ACM international conference on Distributed event-based systems
Escape capsule: explicit state is robust and scalable

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Towards secure and dependable software-defined networks

Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking
Distributing trusted third parties

ACM SIGACT News
Cooperative security in distributed networks

Computer Communications
Towards practical communication in Byzantine-resistant DHTs

IEEE/ACM Transactions on Networking (TON)
Adaptive atomic capture of multiple molecules

Journal of Parallel and Distributed Computing
The TClouds platform: concept, architecture and instantiations

Proceedings of the 2nd International Workshop on Dependability Issues in Cloud Computing
Assessing data availability of Cassandra in the presence of non-accurate membership

Proceedings of the 2nd International Workshop on Dependability Issues in Cloud Computing
Byzantine agreement with homonyms in synchronous systems

Theoretical Computer Science
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
On the use of decentralization to enable privacy in web-scale recommendation services

Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society
Tango: distributed data structures over a shared log

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Leveraging sharding in the design of scalable replication protocols

Proceedings of the 4th annual Symposium on Cloud Computing
COLO: COarse-grained LOck-stepping virtual machines for non-stop service

Proceedings of the 4th annual Symposium on Cloud Computing
Consistency without borders

Proceedings of the 4th annual Symposium on Cloud Computing
Optimizing Paxos with request exchangeability for highly available web services

Proceedings of the 5th Asia-Pacific Symposium on Internetware
On the efficiency of durable state machine replication

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
CORFU: A distributed shared log

ACM Transactions on Computer Systems (TOCS)
On the performance of a retransmission-based synchronizer

Theoretical Computer Science
A protocol for implementing byzantine storage in churn-prone distributed systems

Theoretical Computer Science
Scalable service-oriented replication with flexible consistency guarantee in the cloud

Information Sciences: an International Journal
A fault tolerant platform of web services based on service composition

Multiagent and Grid Systems
Scalable and leaderless Byzantine consensus in cloud computing environments

Information Systems Frontiers

Quantified Score

Hi-index	0.07

Visualization

Abstract

The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models—Byzantine and fail stop. Systems reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.