Apache hadoop goes realtime at Facebook

Authors:
Dhruba Borthakur;Jonathan Gray;Joydeep Sen Sarma;Kannan Muthukkaruppan;Nicolas Spiegelberg;Hairong Kuang;Karthik Ranganathan;Dmytro Molkov;Aravind Menon;Samuel Rash;Rodrigo Schmidt;Amitanand Aiyer
Affiliations:
facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA;facebook, Palo Alto, CA
Venue:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Year:
2011

Citing 0
Cited 32

Experimenting lucene index on HBase in an HPC environment

Proceedings of the first annual workshop on High performance computing meets databases
Living in the present: on-the-fly information processing in scalable web architectures

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
Performance engineering for cloud computing

EPEW'11 Proceedings of the 8th European conference on Computer Performance Engineering
bLSM: a general purpose log structured merge tree

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
NaaS: network-as-a-service in the cloud

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A highly efficient cloud-based architecture for large-scale STB event processing: industry article

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Sweet storage SLOs with Frosting

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
M3R: increased performance for in-memory Hadoop jobs

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Cake: enabling high-level SLOs on shared storage systems

Proceedings of the Third ACM Symposium on Cloud Computing
A Distributed Cache for Hadoop Distributed File System in Real-Time Cloud Services

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Improving Bandwidth Efficiency for Consistent Multistream Storage

ACM Transactions on Storage (TOS)
Pollux: towards scalable distributed real-time search on microblogs

Proceedings of the 16th International Conference on Extending Database Technology
CamCubeOS: a key-based network stack for 3D torus cluster topologies

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Execution and optimization of continuous queries with cyclops

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
LinkBench: a database benchmark based on the Facebook social graph

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
HyMR: a hybrid MapReduce workflow system

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Leveraging endpoint flexibility in data-intensive clusters

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
jVerbs: ultra-low latency for data center applications

Proceedings of the 4th annual Symposium on Cloud Computing
Representing mapreduce optimisations in the nested relational calculus

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Copysets: reducing the frequency of data loss in cloud storage

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Analysis of HDFS under HBase: a facebook messages case study

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Optimizing I/O forwarding techniques for extreme-scale event tracing

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application's requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.