Practical hardening of crash-tolerant systems

Authors:
Miguel Correia;Daniel Gómez Ferro;Flavio P. Junqueira;Marco Serafini
Affiliations:
IST-UTL, INESC-ID, Lisbon, Portugal;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain
Venue:
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Year:
2012

Citing 33
Cited 3

An experimental evaluation of the assumption of independence in multiversion programming

IEEE Transactions on Software Engineering
A Compiler that Increases the Fault Tolerance of Asynchronous Protocols

IEEE Transactions on Computers
Automatically increasing the fault-tolerance of distributed systems

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
What good are models and what models are good?

Distributed systems (2nd Ed.)
Self-stabilizing systems in spite of distributed control

Communications of the ACM
Practical byzantine fault tolerance and proactive recovery

ACM Transactions on Computer Systems (TOCS)
Concurrent Error Detection Using Watchdog Processors-A Survey

IEEE Transactions on Computers
From Crash Fault-Tolerance to Arbitrary-Fault Tolerance: Towards a Modular Approach

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Separating agreement from execution for byzantine fault tolerant services

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Thema: Byzantine-Fault-Tolerant Middleware forWeb-Service Applications

SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
Paxos made live: an engineering perspective

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Zyzzyva: speculative byzantine fault tolerance

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
PeerReview: practical accountability for distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Redundancy in Data Structures: Improving Software Fault Tolerance

IEEE Transactions on Software Engineering
Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers

IEEE Transactions on Dependable and Secure Computing
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Nysiad: practical protocol transformation to tolerate Byzantine failures

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
An analysis of data corruption in the storage stack

ACM Transactions on Storage (TOS)
Eventually consistent

Communications of the ACM - Rural engineering development
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Zab: High-performance broadcast for primary-backup systems

DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Composable reliability for asynchronous systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Software encoded processing: building dependable systems with commodity hardware

SAFECOMP'07 Proceedings of the 26th international conference on Computer Safety, Reliability, and Security
Efficient Byzantine Fault-Tolerance

IEEE Transactions on Computers

Composable reliability for asynchronous systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Towards transparent hardening of distributed systems

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent failures of production systems have highlighted the importance of tolerating faults beyond crashes. The industry has so far addressed this problem by hardening crash-tolerant systems with ad hoc error detection checks, potentially overlooking critical fault scenarios. We propose a generic and principled hardening technique for Arbitrary State Corruption (ASC) faults, which specifically model the effects of realistic data corruptions on distributed processes. Hardening does not require the use of trusted components or the replication of the process over multiple physical servers. We implemented a wrapper library to transparently harden distributed processes. To exercise our library and evaluate our technique, we obtained ASC-tolerant versions of Paxos, of a subset of the ZooKeeper API, and of an eventually consistent storage by implementing crash-tolerant protocols and automatically hardening them using our library. Our evaluation shows that the throughput of our ASC-hardened state machine replication outperforms its Byzantine-tolerant counterpart by up to 70%.