Weaving Relations for Cache Performance
Proceedings of the 27th International Conference on Very Large Data Bases
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
DiskReduce: RAID for data-intensive scalable computing
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Adaptive query execution for data management in the cloud
CloudDB '10 Proceedings of the second international workshop on Cloud data management
Disk-locality in datacenter computing considered irrelevant
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
HiTune: dataflow-based performance analysis for big data cloud
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CloudVista: visual cluster exploration for extreme scale data in the cloud
SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Trojan data layouts: right shoes for a running elephant
Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization using column statistics in hive
Proceedings of the 15th Symposium on International Database Engineering & Applications
ReStore: reusing results of MapReduce jobs
Proceedings of the VLDB Endowment
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis
Proceedings of the 7th ACM european conference on Computer Systems
HiTune: dataflow-based performance analysis for big data cloud
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Resource-aware adaptive scheduling for mapreduce clusters
Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Camdoop: exploiting in-network aggregation for big data applications
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Understanding the effects and implications of compute node related failures in hadoop
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Performance comparisons of spatial data processing techniques for a large scale mobile phone dataset
Proceedings of the 3rd International Conference on Computing for Geospatial Research and Applications
Only aggressive elephants are fast elephants
Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter
Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce
Proceedings of the VLDB Endowment
T: a data-centric cooling energy costs reduction approach for big data analytics cloud
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Discovering OLAP dimensions in semi-structured data
Proceedings of the fifteenth international workshop on Data warehousing and OLAP
Reducing Storage Overhead with Small Write Bottleneck Avoiding in Cloud RAID System
GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Resource-aware adaptive scheduling for MapReduce clusters
Proceedings of the 12th International Middleware Conference
Theia: visual signatures for problem diagnosis in large hadoop clusters
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Bridging the gap between applications and networks in data centers
ACM SIGOPS Operating Systems Review
Building a Data Warehouse for Twitter Stream Exploration
ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
The big data ecosystem at LinkedIn
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience
ACM SIGKDD Explorations Newsletter
Big graph mining: algorithms and discoveries
ACM SIGKDD Explorations Newsletter
Effective straggler mitigation: attack of the clones
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Leveraging endpoint flexibility in data-intensive clusters
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Towards a workload for evolutionary analytics
Proceedings of the Second Workshop on Data Analytics in the Cloud
CopyCatch: stopping group attacks by spotting lockstep behavior in social networks
Proceedings of the 22nd international conference on World Wide Web
The case for tiny tasks in compute clusters
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions
Journal of Grid Computing
SPHINX: rich insights into evidence-hypotheses relationships via parameter space-based exploration
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Meta-stars: multidimensional modeling for social business intelligence
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
A cloud-based intelligent TV program recommendation system
Computers and Electrical Engineering
Decentralized monitoring in peer-to-peer systems
Benchmarking Peer-to-Peer Systems
Hi-index | 0.00 |
Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.