Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Authors:
Kai Ren;YongChul Kwon;Magdalena Balazinska;Bill Howe
Affiliations:
Carnegie Mellon University;Microsoft;University of Washington;University of Washington
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 18
Cited 0

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generating example data for dataflow programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Optimizing data partitioning for data-parallel computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Modeling and synthesizing task placement constraints in Google compute clusters

Proceedings of the 2nd ACM Symposium on Cloud Computing
The Case for Evaluating MapReduce Performance Using Workload Suites

MASCOTS '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Least squares quantization in PCM

IEEE Transactions on Information Theory
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We analyze Hadoop workloads from three di?erent research clusters from a user-centric perspective. The goal is to better understand data scientists' use of the system and how well the use of the system matches its design. Our analysis suggests that Hadoop usage is still in its adolescence. We see underuse of Hadoop features, extensions, and tools. We see significant diversity in resource usage and application styles, including some interactive and iterative workloads, motivating new tools in the ecosystem. We also observe significant opportunities for optimizations of these workloads. We find that job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems. Overall, we present the first user-centered measurement study of Hadoop and find significant opportunities for improving its efficient use for data scientists.