On the self-similar nature of Ethernet traffic (extended version)
IEEE/ACM Transactions on Networking (TON)
Quickly generating billion-record synthetic databases
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
The case for persistent-connection HTTP
SIGCOMM '95 Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
End-to-end Internet packet dynamics
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
A trace-driven analysis of the UNIX 4.2 BSD file system
Proceedings of the tenth ACM symposium on Operating systems principles
Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Introduction to Machine Learning (Adaptive Computation and Machine Learning)
File Classification in Self-* Storage Systems
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Detecting BGP configuration faults with static analysis
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dynamic workload management for very large data warehouses: juggling feathers and bowling balls
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
A scalable, commodity data center network architecture
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
The β-factor: measuring wireless link burstiness
Proceedings of the 6th ACM conference on Embedded network sensor systems
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Quincy: fair scheduling for distributed computing clusters
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Towards characterizing cloud backend workloads: insights from Google compute clusters
ACM SIGMETRICS Performance Evaluation Review
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Proceedings of the ACM SIGCOMM 2010 conference
Energy management for MapReduce clusters
Proceedings of the VLDB Endowment
HaLoop: efficient iterative data processing on large clusters
Proceedings of the VLDB Endowment
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Sierra: practical power-proportionality for data center storage
Proceedings of the sixth conference on Computer systems
Scarlett: coping with skewed content popularity in mapreduce clusters
Proceedings of the sixth conference on Computer systems
Mesos: a platform for fine-grained resource sharing in the data center
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Automatic optimization for MapReduce programs
Proceedings of the VLDB Endowment
Apache hadoop goes realtime at Facebook
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Power management of online data-intensive services
Proceedings of the 38th annual international symposium on Computer architecture
Managing data transfers in computer clusters with orchestra
Proceedings of the ACM SIGCOMM 2011 conference
Design implications for enterprise storage systems via multi-dimensional trace analysis
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
The Case for Evaluating MapReduce Performance Using Workload Suites
MASCOTS '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis
Proceedings of the 7th ACM european conference on Computer Systems
PACMan: coordinated memory caching for parallel jobs
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Parallel analytics as a service
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Workload management for big data analytics
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The case for tiny tasks in compute clusters
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Data warehousing and OLAP over big data: current challenges and future research directions
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Scale-up vs scale-out for Hadoop: time to rethink?
Proceedings of the 4th annual Symposium on Cloud Computing
Proceedings of the 17th International Database Engineering & Applications Symposium
Generating request streams on Big Data using clustered renewal processes
Performance Evaluation
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads
Proceedings of the VLDB Endowment
Cloud Based Big Data Analytics for Smart Future Cities
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
MixApart: decoupled analytics for shared storage systems
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Analysis of HDFS under HBase: a facebook messages case study
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
SpringFS: bridging agility and performance in elastic distributed storage
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.