Paxos replicated state machines as the basis of a high-performance data store

Authors:
William J. Bolosky;Dexter Bradshaw;Randolph B. Haagens;Norbert P. Kusters;Peng Li
Affiliations:
Microsoft Research;Microsoft;Microsoft;Microsoft;Microsoft
Venue:
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Year:
2011

Citing 36
Cited 12

Scale and performance in a distributed file system

ACM Transactions on Computer Systems (TOCS)
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
The design and implementation of a log-structured file system

ACM Transactions on Computer Systems (TOCS)
Fundamentals of operating systems (5th ed.)

Fundamentals of operating systems (5th ed.)
The TickerTAIP parallel RAID architecture

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Scheduling algorithms for modern disk drives

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The Zebra striped network file system

ACM Transactions on Computer Systems (TOCS)
Serverless network file systems

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Petal: distributed virtual disks

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Windows NT file system internals: a developer's guide

Windows NT file system internals: a developer's guide
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
The ABCD's of Paxos

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Distributed Algorithms

Distributed Algorithms
VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS

VIEWSTAMPED REPLICATION FOR HIGHLY AVAILABLE DISTRIBUTED SYSTEMS
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Separating agreement from execution for byzantine fault tolerant services

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
FAB: building distributed enterprise disk arrays from commodity components

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Reliable Distributed Systems: Technologies, Web Services, and Applications

Reliable Distributed Systems: Technologies, Web Services, and Applications
Fault-scalable Byzantine fault-tolerant services

Proceedings of the twentieth ACM symposium on Operating systems principles
The SMART way to migrate replicated stateful services

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Boxwood: abstractions as the foundation for storage infrastructure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
HQ replication: a hybrid quorum protocol for byzantine fault tolerance

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
The Chubby lock service for loosely-coupled distributed systems

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Paxos made live: an engineering perspective

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Zyzzyva: speculative byzantine fault tolerance

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
A nine year study of file system and storage benchmarking

ACM Transactions on Storage (TOS)
The Bayou Architecture: Support for Data Sharing Among Mobile Users

WMCSA '94 Proceedings of the 1994 First Workshop on Mobile Computing Systems and Applications
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Microsoft SQL Server 2008 Internals

Microsoft SQL Server 2008 Internals
An efficient algorithm for exploiting multiple arithmetic units

IBM Journal of Research and Development
Everest: scaling down peak loads through I/O off-loading

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems

Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
CORFU: a shared log design for flash clusters

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Speculative linearizability

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Gnothi: separating data and metadata for efficient and available storage replication

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Durability with BookKeeper

ACM SIGOPS Operating Systems Review
Stronger semantics for low-latency geo-replicated storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Transaction chains: achieving serializability with low latency in geo-distributed storage systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Tango: distributed data structures over a shared log

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Optimizing Paxos with request exchangeability for highly available web services

Proceedings of the 5th Asia-Pacific Symposium on Internetware
On the efficiency of durable state machine replication

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional wisdom holds that Paxos is too expensive to use for high-volume, high-throughput, data-intensive applications. Consequently, fault-tolerant storage systems typically rely on special hardware, semantics weaker than sequential consistency, a limited update interface (such as append-only), primary-backup replication schemes that serialize all reads through the primary, clock synchronization for correctness, or some combination thereof. We demonstrate that a Paxos-based replicated state machine implementing a storage service can achieve performance close to the limits of the underlying hardware while tolerating arbitrary machine restarts, some permanent machine or disk failures and a limited set of Byzantine faults. We also compare it with two versions of primary-backup. The replicated state machine can serve as the data store for a file system or storage array. We present a novel algorithm for ensuring read consistency without logging, along with a sketch of a proof of its correctness.