Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Authors:
Hyuck Han;Hyungsoo Jung;Hyeonsang Eom;Heon Y. Yeom
Affiliations:
School of Computer Science and Engineering, Seoul National University, Seoul, Korea 151-742;School of Information Technologies, University of Sydney, Sydney, Australia 2006;School of Computer Science and Engineering, Seoul National University, Seoul, Korea 151-742;School of Computer Science and Engineering, Seoul National University, Seoul, Korea 151-742
Venue:
Cluster Computing
Year:
2011

Citing 20
Cited 4

On optimal processor allocation to support pipelined hash joins

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Multi-table joins through bitmapped join indices

ACM SIGMOD Record
Building the data warehouse (2nd ed.)

Building the data warehouse (2nd ed.)
An overview of data warehousing and OLAP technology

ACM SIGMOD Record
Improved query performance with variant indexes

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Materialized views and data warehouses

ACM SIGMOD Record
Caching multidimensional queries using chunks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Applying Segmented Right-Deep Trees to Pipelining Multiple Hash Joins

IEEE Transactions on Knowledge and Data Engineering
Parallel Star Join + DataIndexes: Efficient Query Processing in Data Warehouses and OLAP

IEEE Transactions on Knowledge and Data Engineering
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Processing star queries on hierarchically-clustered fact tables

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Star join revisited: Performance internals for cluster architectures

Data & Knowledge Engineering
Data mining using high performance data clouds: experimental studies using sector and sphere

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology

Improving performance of MapReduce framework on InterCloud by avoiding transmission of unnecessary data

ACM SIGMETRICS Performance Evaluation Review
Cache conscious star-join in MapReduce environments

Proceedings of the 2nd International Workshop on Cloud Intelligence
Cloud-aware processing of MapReduce-based OLAP applications

AusPDC '13 Proceedings of the Eleventh Australasian Symposium on Parallel and Distributed Computing - Volume 140
A MapReduce task scheduling algorithm for deadline constraints

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.