Untangling cluster management with Helix

Authors:
Kishore Gopalakrishna;Shi Lu;Zhen Zhang;Adam Silberstein;Kapil Surlaker;Ramesh Subramonian;Bob Schulman
Affiliations:
LinkedIn, Mountain View, CA;LinkedIn, Mountain View, CA;LinkedIn, Mountain View, CA;LinkedIn, Mountain View, CA;LinkedIn, Mountain View, CA;LinkedIn, Mountain View, CA;LinkedIn, Mountain View, CA
Venue:
Proceedings of the Third ACM Symposium on Cloud Computing
Year:
2012

Citing 5
Cited 2

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
The datacenter needs an operating system

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
F1: the fault-tolerant distributed RDBMS supporting google's ad business

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

On brewing fresh espresso: LinkedIn's distributed data serving platform

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
In data veritas: data driven testing for distributed systems

Proceedings of the Sixth International Workshop on Testing Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed data systems systems are used in a variety of settings like online serving, offline analytics, data transport, and search, among other use cases. They let organizations scale out their workloads using cost-effective commodity hardware, while retaining key properties like fault tolerance and scalability. At LinkedIn we have built a number of such systems. A key pattern we observe is that even though they may serve different purposes, they tend to have a lot of common functionality, and tend to use common building blocks in their architectures. One such building block that is just beginning to receive attention is cluster management, which addresses the complexity of handling a dynamic, large-scale system with many servers. Such systems must handle software and hardware failures, setup tasks such as bootstrapping data, and operational issues such as data placement, load balancing, planned upgrades, and cluster expansion. All of this shared complexity, which we see in all of our systems, motivates us to build a cluster management framework, Helix, to solve these problems once in a general way. Helix provides an abstraction for a system developer to separate coordination and management tasks from component functional tasks of a distributed system. The developer defines the system behavior via a state model that enumerates the possible states of each component, the transitions between those states, and constraints that govern the system's valid settings. Helix does the heavy lifting of ensuring the system satisfies that state model in the distributed setting, while also meeting the system's goals on load balancing and throttling state changes. We detail several Helix-managed production distributed systems at LinkedIn and how Helix has helped them avoid building custom management components. We describe the Helix design and implementation and present an experimental study that demonstrates its performance and functionality.