Optimizing joins in a map-reduce environment

Authors:
Foto N. Afrati;Jeffrey D. Ullman
Affiliations:
National Technical University of Athens, Greece;Stanford University
Venue:
Proceedings of the 13th International Conference on Extending Database Technology
Year:
2010

Citing 21
Cited 37

A note on the strategy space of multiway join query optimization problem in parallel systems

ACM SIGMOD Record
Tree-based techniques for query evaluation

Tree-based techniques for query evaluation
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
An adaptive query execution system for data integration

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Continuously adaptive continuous queries over streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
StreaMon: an adaptive engine for stream query processing

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Adaptive Caching for Continuous Queries

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Query optimization over web services

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Maximizing the output rate of multi-way join queries over streaming information sources

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
Optimal splitters for database partitioning with size bounds

Proceedings of the 12th International Conference on Database Theory
Flow Algorithms for Parallel Query Optimization

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
The declarative imperative: experiences and conjectures in distributed logic

ACM SIGMOD Record
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Cluster Computing
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Datalog-Based program analysis with BES and RWL

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
Cluster computing, recursion and datalog

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Parallel skyline queries

Proceedings of the 15th International Conference on Database Theory
Performance guarantees for distributed reachability queries

Proceedings of the VLDB Endowment
Towards efficient join processing over large RDF graph using mapreduce

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Join processing using Bloom filter in MapReduce

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
SemanMR: big data processing framework based on semantics

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Processing multi-way spatial joins on map-reduce

Proceedings of the 16th International Conference on Extending Database Technology
Communication steps for parallel query processing

Proceedings of the 32nd symposium on Principles of database systems
Toward intersection filter-based optimization for joins in MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
MRPacker: an SQL to mapreduce optimizer

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Big data begets big database theory

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Querying big social data

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Computing the stratified semantics of logic programs over big data through mass parallelization

RuleML'13 Proceedings of the 7th international conference on Theory, Practice, and Applications of Rules on the Web
Making queries tractable on big data with preprocessing: through the eyes of complexity theory

Proceedings of the VLDB Endowment
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.