Extending Map-Reduce for Efficient Predicate-Based Sampling

Authors:
Raman Grover;Michael J. Carey
Affiliations:
-;-
Venue:
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Year:
2012

Citing 0
Cited 4

Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Taming massive distributed datasets: data sampling using bitmap indices

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Balancing reducer workload for skewed data using sampling-based partitioning

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we address the problem of using MapReduce to sample a massive data set in order to produce a fixed-size sample whose contents satisfy a given predicate. While it is simple to express this computation using MapReduce, its default Hadoop execution is dependent on the input size and is wasteful of cluster resources. This is unfortunate, as sampling queries are fairly common (e.g., for exploratory data analysis at Facebook), and the resulting waste can significantly impact the performance of a shared cluster. To address such use cases, we present the design, implementation and evaluation of a Hadoop execution model extension that supports incremental job expansion. Under this model, a job consumes input as required and can dynamically govern its resource consumption while producing the required results. The proposed mechanism is able to support a variety of policies regarding job growth rates as they relate to cluster capacity and current load. We have implemented the mechanism in Hadoop, and we present results from an experimental performance study of different job growth policies under both single- and multi-user workloads.