ZooKeeper: wait-free coordination for internet-scale systems

Authors:
Patrick Hunt;Mahadev Konar;Flavio P. Junqueira;Benjamin Reed
Affiliations:
Yahoo! Grid;Yahoo! Grid;Yahoo! Research;Yahoo! Research
Venue:
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Year:
2010

Citing 25
Cited 95

Scale and performance in a distributed file system

ACM Transactions on Computer Systems (TOCS)
Linearizability: a correctness condition for concurrent objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Horus: a flexible group communication system

Communications of the ACM
The dangers of replication and a solution

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Building adaptive systems using ensemble

Software—Practice & Experience - Special issue on multiprocessor operating systems
Distributed systems (2nd Ed.)

Distributed systems (2nd Ed.)
VAXclusters (extended abstract): a closely-coupled distributed system

Proceedings of the tenth ACM symposium on Operating systems principles
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
Practical byzantine fault tolerance and proactive recovery

ACM Transactions on Computer Systems (TOCS)
The Totem System

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault-scalable Byzantine fault-tolerant services

Proceedings of the twentieth ACM symposium on Operating systems principles
ACMS: the Akamai configuration management system

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Boxwood: abstractions as the foundation for storage infrastructure

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
The Chubby lock service for loosely-coupled distributed systems

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Paxos made live: an engineering perspective

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Zyzzyva: speculative byzantine fault tolerance

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Sinfonia: a new paradigm for building scalable distributed systems

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
DepSpace: a byzantine fault-tolerant coordination service

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
A simple totally ordered broadcast protocol

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Zeno: eventually consistent Byzantine-fault tolerance

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Upright cluster services

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles

Weak consistency as a last resort

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Scalable agreement: toward ordering as a service

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Active quorum systems

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Onix: a distributed control platform for large-scale production networks

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Using Paxos to build a scalable, consistent, and highly available datastore

Proceedings of the VLDB Endowment
Scale and concurrency of GIGA+: file system directories with millions of files

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
DepSky: dependable and secure storage in a cloud-of-clouds

Proceedings of the sixth conference on Computer systems
Increasing performance in byzantine fault-tolerant systems with on-demand replica consistency

Proceedings of the sixth conference on Computer systems
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Social-data storage-systems

Databases and Social Networks
Distributed middleware reliability and fault tolerance support in system S

Proceedings of the 5th ACM international conference on Distributed event-based system
YCSB++: benchmarking and performance debugging advanced features in scalable table stores

Proceedings of the 2nd ACM Symposium on Cloud Computing
Automatic management of partitioned, replicated search services

Proceedings of the 2nd ACM Symposium on Cloud Computing
Scalable consistency in Scatter

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Thialfi: a client notification service for internet-scale applications

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
PREFAIL: a programmable tool for multiple-failure injection

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Chimera: data sharing flexibility, shared nothing simplicity

Proceedings of the 15th Symposium on International Database Engineering & Applications
Adaptive and dynamic funnel replication in clouds

ACM SIGOPS Operating Systems Review
Providing fault-tolerant execution of web-service-based workflows within clouds

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
A critique of snapshot isolation

Proceedings of the 7th ACM european conference on Computer Systems
CheapBFT: resource-efficient byzantine fault tolerance

Proceedings of the 7th ACM european conference on Computer Systems
The evolving landscape of data management in the cloud

International Journal of Computational Science and Engineering
Leader election for replicated services using application scores

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Probabilistically bounded staleness for practical partial quorums

Proceedings of the VLDB Endowment
Calvin: fast distributed transactions for partitioned database systems

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Walnut: a unified cloud object store

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
CORFU: a shared log design for flash clusters

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
HyperDex: a distributed, searchable key-value store

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Hierarchical policies for software defined networks

Proceedings of the first workshop on Hot topics in software defined networks
Big data platforms as a service: challenges and approach

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
TROPIC: transactional resource orchestration platform in the cloud

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Gnothi: separating data and metadata for efficient and available storage replication

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Dynamic reconfiguration of primary/backup clusters

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Unifying synchronization and events in a multicore OS

Proceedings of the Asia-Pacific Workshop on Systems
Serializability, not serial: concurrency control and availability in multi-datacenter datastores

Proceedings of the VLDB Endowment
Solving big data challenges for enterprise application performance management

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
HyperDex: a distributed, searchable key-value store

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Infrastructure outsourcing in multi-cloud environment

Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit
On the optimization of schedules for MapReduce workloads in the presence of shared scans

The VLDB Journal — The International Journal on Very Large Data Bases
Unifying synchronization and events in a multicore OS

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Toward a principled framework for benchmarking consistency

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
All about Eve: execute-verify replication for multi-core servers

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
High throughput computing over peer-to-peer networks

Future Generation Computer Systems
xOMB: extensible open middleboxes with commodity servers

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Zoolander: efficient latency management in NoSQL stores

Proceedings of the Posters and Demo Track
Enhancing coordination in cloud infrastructures with an extendable coordination service

Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
µLibCloud: Providing High Available and Uniform Accessing to Multiple Cloud Storages

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Leader election for replicated services using application scores

Proceedings of the 12th International Middleware Conference
X10-FT: transparent fault tolerance for APGAS language and runtime

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Oozie: towards a scalable workflow management system for Hadoop

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
ElasTraS: An elastic, scalable, and self-managing transactional database for the cloud

ACM Transactions on Database Systems (TODS)
Component-based scalability for cloud applications

Proceedings of the 3rd International Workshop on Cloud Data and Platforms
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Design and implementation of caching services in the cloud

IBM Journal of Research and Development
Split/merge: system support for elastic execution in virtual middleboxes

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Dasu: pushing experiments to the internet's edge

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Beyond block I/O: implementing a distributed shared log in hardware

Proceedings of the 6th International Systems and Storage Conference
Participatory networking: an API for application control of SDNs

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Grand challenge: the bluebay soccer monitoring engine

Proceedings of the 7th ACM international conference on Distributed event-based systems
Escape capsule: explicit state is robust and scalable

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
There is more consensus in Egalitarian parliaments

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Tango: distributed data structures over a shared log

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Leveraging sharding in the design of scalable replication protocols

Proceedings of the 4th annual Symposium on Cloud Computing
CATS: a linearizable and self-organizing key-value store

Proceedings of the 4th annual Symposium on Cloud Computing
Network-aware data caching and prefetching for cloud-hosted metadata retrieval

NDM '13 Proceedings of the Third International Workshop on Network-Aware Data Management
On the efficiency of durable state machine replication

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
DepSky: Dependable and Secure Storage in a Cloud-of-Clouds

ACM Transactions on Storage (TOS)
CORFU: A distributed shared log

ACM Transactions on Computer Systems (TOCS)
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Efficient transactions for parallel data movement

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
State based Paxos

Proceedings of the Industrial Track of the 13th ACM/IFIP/USENIX International Middleware Conference
Resilient X10: efficient failure-aware programming

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Eventually consistent: not what you were expecting?

Communications of the ACM
An evaluation of zookeeper for high availability in system S

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Social TV analytics: a novel paradigm to transform TV watching experience

Proceedings of the 5th ACM Multimedia Systems Conference
Eventually Consistent: Not What You Were Expecting?

Queue - Performance
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing
Network virtualization in multi-tenant datacenters

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Aggregation and degradation in JetStream: streaming analytics in the wide area

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
FaRM: fast remote memory

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.02

Visualization

Abstract

In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications. Since ZooKeeper is part of critical infrastructure, ZooKeeper aims to provide a simple and high performance kernel for building more complex coordination primitives at the client. It incorporates elements from group messaging, shared registers, and distributed lock services in a replicated, centralized service. The interface exposed by Zoo-Keeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service. The ZooKeeper interface enables a high-performance service implementation. In addition to the wait-free property, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state. These design decisions enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers. We show for the target workloads, 2:1 to 100:1 read to write ratio, that ZooKeeper can handle tens to hundreds of thousands of transactions per second. This performance allows ZooKeeper to be used extensively by client applications.