MRShare: sharing across multiple queries in MapReduce

Authors:
Tomasz Nykiel;Michalis Potamias;Chaitanya Mishra;George Kollios;Nick Koudas
Affiliations:
University of Toronto;Boston University;Facebook;Boston University;University of Toronto
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 22
Cited 24

Multiple-query optimization

ACM Transactions on Database Systems (TODS)
Optimization of queries with user-defined predicates

ACM Transactions on Database Systems (TODS)
Common expression analysis in database applications

SIGMOD '82 Proceedings of the 1982 ACM SIGMOD international conference on Management of data
QPipe: a simultaneously pipelined relational query engine

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Efficient exploitation of similar subexpressions for query processing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Cooperative scans: dynamic bandwidth sharing in a DBMS

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Main-memory scan sharing for multi-core CPUs

Proceedings of the VLDB Endowment
Scheduling shared scans of large data files

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
A scalable, predictable join operator for highly concurrent data warehouses

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Adapting skyline computation to the MapReduce framework: algorithms and experiments

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Tagged mapreduce: efficiently computing multi-analytics using mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
CIRCUMFLEX: a scheduling optimizer for MapReduce workloads with shared scans

ACM SIGOPS Operating Systems Review
ReStore: reusing results of MapReduce jobs

Proceedings of the VLDB Endowment
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
On the optimization of schedules for MapReduce workloads in the presence of shared scans

The VLDB Journal — The International Journal on Very Large Data Bases
HadoopXML: a suite for parallel processing of massive XML data with multiple twig pattern queries

Proceedings of the 21st ACM international conference on Information and knowledge management
Join processing using Bloom filter in MapReduce

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Toward intersection filter-based optimization for joins in MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Mosquito: another one bites the data upload stream

Proceedings of the VLDB Endowment
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges incurred while utilizing the processing infrastructure. In this paper we propose a sharing framework tailored to MapReduce. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Experiments in our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach and substantial savings.