Optimizing Multiway Joins in a Map-Reduce Environment

Authors:
Foto N. Afrati;Jeffrey D. Ullman
Affiliations:
National Techincal University Athens, Athens;Stanford University, Stanford
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2011

Citing 0
Cited 9

Matrix chain multiplication via multi-way join algorithms in MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
Cloud-based image processing system with priority-based data distribution mechanism

Computer Communications
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Upper and lower bounds on the cost of a map-reduce computation

Proceedings of the VLDB Endowment
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
A Large-scale Images Processing Model Based on Hadoop Platform

Proceedings of the Second International Conference on Innovative Computing and Cloud Computing
Exploiting inter-operation parallelism for matrix chain multiplication using MapReduce

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a “share,” which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where a variable is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case, we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: 1) analytic queries in which a very large fact table is joined with smaller dimension tables, and 2) queries involving paths through graphs with high out-degree, such as the Web or a social network.