Efficient multi-way theta-join processing using MapReduce

Authors:
Xiaofei Zhang;Lei Chen;Min Wang
Affiliations:
HKUST, Hong Kong;HKUST, Hong Kong;HP Labs China, Beijing, China
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 23
Cited 2

A note on the strategy space of multiway join query optimization problem in parallel systems

ACM SIGMOD Record
Optimization of real conjunctive queries

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A threshold of ln n for approximating set cover

Journal of the ACM (JACM)
Optimizing Large Join Queries Using A Graph-Based Approach

IEEE Transactions on Knowledge and Data Engineering
Scheduling Malleable Parallel Tasks: An Asymptotic Fully Polynomial Time Approximation Scheme

Algorithmica
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Scheduling shared scans of large data files

Proceedings of the VLDB Endowment
Introduction to Algorithms, Third Edition

Introduction to Algorithms, Third Edition
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
G-Store: a scalable data store for transactional multi key access in the cloud

Proceedings of the 1st ACM symposium on Cloud computing
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Apache hadoop goes realtime at Facebook

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing

IEEE Transactions on Parallel and Distributed Systems
ES2: A cloud data storage system for supporting both OLTP and OLAP

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
YSmart: Yet Another SQL-to-MapReduce Translator

ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
Optimizing Multiway Joins in a Map-Reduce Environment

IEEE Transactions on Knowledge and Data Engineering
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing

Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key, value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.