Data warehousing and analytics infrastructure at facebook

Authors:
Ashish Thusoo;Zheng Shao;Suresh Anthony;Dhruba Borthakur;Namit Jain;Joydeep Sen Sarma;Raghotham Murthy;Hao Liu
Affiliations:
Facebook, Palo Alto, CA, USA;Facebook, Palo Alto, CA, USA;Facebook, Palo Alto, CA, USA;Facebook, Palo Alto, CA, USA;Facebook, Palo Alto, CA, USA;Facebook, Palo Alto, CA, USA;Facebook, Palo Alto, CA, USA;Facebook, Palo Alto, CA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 4
Cited 39

Weaving Relations for Cache Performance

Proceedings of the 27th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
DiskReduce: RAID for data-intensive scalable computing

Proceedings of the 4th Annual Workshop on Petascale Data Storage

Adaptive query execution for data management in the cloud

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Disk-locality in datacenter computing considered irrelevant

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
HiTune: dataflow-based performance analysis for big data cloud

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CloudVista: visual cluster exploration for extreme scale data in the cloud

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization using column statistics in hive

Proceedings of the 15th Symposium on International Database Engineering & Applications
ReStore: reusing results of MapReduce jobs

Proceedings of the VLDB Endowment
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
HiTune: dataflow-based performance analysis for big data cloud

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Resource-aware adaptive scheduling for mapreduce clusters

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Performance comparisons of spatial data processing techniques for a large scale mobile phone dataset

Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
T: a data-centric cooling energy costs reduction approach for big data analytics cloud

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Discovering OLAP dimensions in semi-structured data

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
Reducing Storage Overhead with Small Write Bottleneck Avoiding in Cloud RAID System

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Resource-aware adaptive scheduling for MapReduce clusters

Proceedings of the 12th International Middleware Conference
Theia: visual signatures for problem diagnosis in large hadoop clusters

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Bridging the gap between applications and networks in data centers

ACM SIGOPS Operating Systems Review
Building a Data Warehouse for Twitter Stream Exploration

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Big graph mining: algorithms and discoveries

ACM SIGKDD Explorations Newsletter
Effective straggler mitigation: attack of the clones

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Leveraging endpoint flexibility in data-intensive clusters

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Towards a workload for evolutionary analytics

Proceedings of the Second Workshop on Data Analytics in the Cloud
CopyCatch: stopping group attacks by spotting lockstep behavior in social networks

Proceedings of the 22nd international conference on World Wide Web
The case for tiny tasks in compute clusters

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
SPHINX: rich insights into evidence-hypotheses relationships via parameter space-based exploration

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Meta-stars: multidimensional modeling for social business intelligence

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
A cloud-based intelligent TV program recommendation system

Computers and Electrical Engineering
Decentralized monitoring in peer-to-peer systems

Benchmarking Peer-to-Peer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.