Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Replication in the harp file system
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
An implementation for small databases with high availability
ACM SIGOPS Operating Systems Review
ACM Transactions on Computer Systems (TOCS)
Practical byzantine fault tolerance and proactive recovery
ACM Transactions on Computer Systems (TOCS)
Implementation techniques for main memory database systems
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Main Memory Database Systems: An Overview
IEEE Transactions on Knowledge and Data Engineering
Online Reconfiguration in Replicated Databases Based on Group Communication
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
BASE: Using abstraction to improve fault tolerance
ACM Transactions on Computer Systems (TOCS)
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
The SMART way to migrate replicated stateful services
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Paxos made live: an engineering perspective
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
DepSpace: a byzantine fault-tolerant coordination service
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Paxos for System Builders: an overview
LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Making Byzantine fault tolerant systems tolerate Byzantine faults
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Zyzzyva: Speculative Byzantine fault tolerance
ACM Transactions on Computer Systems (TOCS)
Highly Available Intrusion-Tolerant Services with Proactive-Reactive Recovery
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 5th European conference on Computer systems
Parallel state transfer in object replication systems
DAIS'07 Proceedings of the 7th IFIP WG 6.1 international conference on Distributed applications and interoperable systems
Benchmarking cloud serving systems with YCSB
Proceedings of the 1st ACM symposium on Cloud computing
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Using Paxos to build a scalable, consistent, and highly available datastore
Proceedings of the VLDB Endowment
Paxos replicated state machines as the basis of a high-performance data store
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Fast crash recovery in RAMCloud
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Windows Azure Storage: a highly available cloud storage service with strong consistency
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Zab: High-performance broadcast for primary-backup systems
DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
CheapBFT: resource-efficient byzantine fault tolerance
Proceedings of the 7th ACM european conference on Computer Systems
From Byzantine Consensus to BFT State Machine Replication: A Latency-Optimal Transformation
EDCC '12 Proceedings of the 2012 Ninth European Dependable Computing Conference
Gnothi: separating data and metadata for efficient and available storage replication
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Spanner: Google's globally-distributed database
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Hi-index | 0.00 |
State Machine Replication (SMR) is a fundamental technique for ensuring the dependability of critical services in modern internet-scale infrastructures. SMR alone does not protect from full crashes, and thus in practice it is employed together with secondary storage to ensure the durability of the data managed by these services. In this work we show that the classical durability enforcing mechanisms - logging, checkpointing, state transfer - can have a high impact on the performance of SMR-based services even if SSDs are used instead of disks. To alleviate this impact, we propose three techniques that can be used in a transparent manner, i.e., without modifying the SMR programming model or requiring extra resources: parallel logging, sequential checkpointing, and collaborative state transfer. We show the benefits of these techniques experimentally by implementing them in an open-source replication library, and evaluating them in the context of a consistent key-value store and a coordination service.