Benchmarking approach for designing a mapreduce performance model

Authors:
Zhuoyao Zhang;Ludmila Cherkasova;Boon Thau Loo
Affiliations:
University of Pennsylvania, Philadelphia, Pennsylvania, USA;Hewlett-Packard Labs, Palo Alto, California, USA;University of Pennsylvania, Philadelphia, Pennsylvania, USA
Venue:
Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Year:
2013

Citing 6
Cited 1

MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Proceedings of the 2nd ACM Symposium on Cloud Computing
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
Tarazu: optimizing MapReduce on heterogeneous clusters

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Resource provisioning framework for mapreduce jobs with performance goals

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware

Continuous validation of load test suites

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In MapReduce environments, many of the programs are reused for processing a regularly incoming new data. A typical user question is how to estimate the completion time of these programs as a function of a new dataset and the cluster resources. In this work1 , we offer a novel performance evaluation framework for answering this question. We observe that the execution of each map (reduce) tasks consists of specific, well-defined data processing phases. Only map and reduce functions are custom and their executions are user-defined for different MapReduce jobs. The executions of the remaining phases are generic and depend on the amount of data processed by the phase and the performance of underlying Hadoop cluster. First, we design a set of parameterizable microbenchmarks to measure generic phases and to derive a platform performance model of a given Hadoop cluster. Then using the job past executions, we summarize job's properties and performance of its custom map/reduce functions in a compact job profile. Finally, by combining the knowledge of the job profile and the derived platform performance model, we offer a MapReduce performance model that estimates the program completion time for processing a new dataset. The evaluation study justifies our approach and the proposed framework: we are able to accurately predict performance of the diverse set of twelve MapReduce applications. The predicted completion times for most experiments are within 10% of the measured ones (with a worst case resulting in 17% of error) on our 66-node Hadoop cluster.