Optimization strategies for A/B testing on HADOOP

Authors:
Andrii Cherniak;Huma Zaidi;Vladimir Zadorozhny
Affiliations:
University of Pittsburgh, Pittsburgh, PA;eBay Inc., San Jose, CA;University of Pittsburgh, Pittsburgh, PA
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 16
Cited 0

Parallel database systems: the future of high performance database systems

Communications of the ACM
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Controlled experiments on the web: survey and practical guide

Data Mining and Knowledge Discovery
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Overlapping experiment infrastructure: more, better, faster experimentation

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
FLEX: a slot allocation scheduling optimizer for MapReduce workloads

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Delay tails in MapReduce scheduling

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Optimizing Completion Time and Resource Provisioning of Pig Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Automated profiling and resource management of pig programs for meeting service level objectives

Proceedings of the 9th international conference on Autonomic computing
Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance

MASCOTS '12 Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, we present a set of techniques that considerably improve the performance of executing concurrent MapReduce jobs. Our proposed solution relies on proper resource allocation for concurrent Hive jobs based on data dependency, inter-query optimization and modeling of Hadoop cluster load. To the best of our knowledge, this is the first work towards Hive/MapReduce job optimization which takes Hadoop cluster load into consideration. We perform an experimental study that demonstrates 233% reduction in execution time for concurrent vs sequential execution schema. We report up to 40% extra reduction in execution time for concurrent job execution after resource usage optimization. The results reported in this paper were obtained in a pilot project to assess the feasibility of migrating A/B testing from Teradata + SAS analytics infrastructure to Hadoop. This work was performed on eBay production Hadoop cluster.