Object storage on CRAQ: high-throughput chain replication for read-mostly workloads

Authors:
Jeff Terrace;Michael J. Freedman
Affiliations:
Princeton University;Princeton University
Venue:
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Year:
2009

Citing 36
Cited 13

A quorum-consensus replication method for abstract data types

ACM Transactions on Computer Systems (TOCS)
Replication in the harp file system

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Availability in the Sprite distributed file system

ACM SIGOPS Operating Systems Review
Disconnected operation in the Coda File System

ACM Transactions on Computer Systems (TOCS)
The process group approach to reliable distributed computing

Communications of the ACM
File-system development with stackable layers

ACM Transactions on Computer Systems (TOCS) - Special issue on operating systems principles
Byzantine quorum systems

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Cluster-based scalable network services

Proceedings of the sixteenth ACM symposium on Operating systems principles
Flexible update propagation for weakly consistent replication

Proceedings of the sixteenth ACM symposium on Operating systems principles
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Practical Byzantine fault tolerance

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Separating key management from file system security

Proceedings of the seventeenth ACM symposium on Operating systems principles
Towards robust distributed systems (abstract)

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
The costs and limits of availability for replicated services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Wide-area cooperative storage with CFS

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Chord: a scalable peer-to-peer lookup protocol for internet applications

IEEE/ACM Transactions on Networking (TON)
Weighted voting for replicated data

SOSP '79 Proceedings of the seventh ACM symposium on Operating systems principles
A principle for resilient sharing of distributed resources

ICSE '76 Proceedings of the 2nd international conference on Software engineering
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Canon in G Major: Designing DHTs with Hierarchical Structure

ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
Fault-scalable Byzantine fault-tolerant services

Proceedings of the twentieth ACM symposium on Operating systems principles
Democratizing content publication with coral

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Chain replication for supporting high throughput and availability

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Boxwood: abstractions as the foundation for storage infrastructure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Efficient replica maintenance for distributed storage systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
A High Throughput Atomic Storage Algorithm

ICDCS '07 Proceedings of the 27th International Conference on Distributed Computing Systems
Timestamp-based algorithms for concurrency control in distributed database systems

VLDB '80 Proceedings of the sixth international conference on Very Large Data Bases - Volume 6
Sinfonia: a new paradigm for building scalable distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
A Formal Model of Crash Recovery in a Distributed System

IEEE Transactions on Software Engineering
Events can make sense

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
Peer-assisted content distribution with prices

CoNEXT '08 Proceedings of the 2008 ACM CoNEXT Conference

FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Lithium: virtual machine storage for the cloud

Proceedings of the 1st ACM symposium on Cloud computing
DONAR: decentralized server selection for cloud services

Proceedings of the ACM SIGCOMM 2010 conference
Chain replication in theory and in practice

Proceedings of the 9th ACM SIGPLAN workshop on Erlang
Windows Azure Storage: a highly available cloud storage service with strong consistency

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Don't settle for eventual: scalable causal consistency for wide-area storage with COPS

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Adaptive and dynamic funnel replication in clouds

ACM SIGOPS Operating Systems Review
CORFU: a shared log design for flash clusters

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Granola: low-overhead distributed transaction coordination

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
ChainReaction: a causal+ consistent datastore based on chain replication

Proceedings of the 8th ACM European Conference on Computer Systems
GPFS-SNC: an enterprise storage framework for virtual-machine clouds

IBM Journal of Research and Development
Stronger semantics for low-latency geo-replicated storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Understanding and mitigating the impact of load imbalance in the memory caching tier

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Massive storage systems typically replicate and partition data over many potentially-faulty components to provide both reliability and scalability. Yet many commerciallydeployed systems, especially those designed for interactive use by customers, sacrifice stronger consistency properties in the desire for greater availability and higher throughput. This paper describes the design, implementation, and evaluation of CRAQ, a distributed object-storage system that challenges this inflexible tradeoff. Our basic approach, an improvement on Chain Replication, maintains strong consistency while greatly improving read throughput. By distributing load across all object replicas, CRAQ scales linearly with chain size without increasing consistency coordination. At the same time, it exposes noncommitted operations for weaker consistency guarantees when this suffices for some applications, which is especially useful under periods of high system churn. This paper explores additional design and implementation considerations for geo-replicated CRAQ storage across multiple datacenters to provide locality-optimized operations. We also discuss multi-object atomic updates and multicast optimizations for large-object updates.