Toward intersection filter-based optimization for joins in MapReduce

Authors:
Thuong-Cang Phan;Laurent d'Orazio;Philippe Rigaux
Affiliations:
Blaise Pascal University, Clermont-Ferrand, France;Blaise Pascal University, Clermont-Ferrand, France;CEDRICm CNAM, Paris, France
Venue:
Proceedings of the 2nd International Workshop on Cloud Intelligence
Year:
2013

Citing 23
Cited 0

Join processing in database systems with large main memories

ACM Transactions on Database Systems (TODS)
Optimal Semijoins for Distributed Database Systems

IEEE Transactions on Software Engineering
Join and Semijoin Algorithms for a Multiprocessor Database Machine

ACM Transactions on Database Systems (TODS)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
R* Optimizer Validation and Performance Evaluation for Distributed Queries

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Improving distributed join efficiency with extended bloom filter operations

AINA '07 Proceedings of the 21st International Conference on Advanced Networking and Applications
Less hashing, same performance: building a better bloom filter

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Optimizing Distributed Joins with Bloom Filters

ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Storage and access in relational data bases

IBM Systems Journal
The Dynamic Bloom Filters

IEEE Transactions on Knowledge and Data Engineering
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
Hadoop in Action

Hadoop in Action
Design and evaluation of main memory hash join algorithms for multi-core CPUs

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

IEEE Transactions on Knowledge and Data Engineering
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
An efficient equi-semi-join algorithm for distributed architectures

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Join processing using Bloom filter in MapReduce

Proceedings of the 2012 ACM Research in Applied Computation Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce has become an attractive and dominant model for processing large-scale datasets. However, this model is not designed to directly support operations with multiple inputs as joins. Many studies on join algorithms including Bloom join in MapReduce have been conducted but they still have too much non-joining data generated and transmitted over the network. This research will help us eliminate the problem by providing an intersection filter based on probabilistic models to remove most disjoint elements between two datasets. Namely, three ways are proposed to build the intersection Bloom filter. To apply the filter to joins, a corresponding MapReduce job will be adjusted in a consistent way without increasing related costs. We then consider two-way joins and join cascades and analyze their costs. As a result, thanks to the high accuracy intersection filter, join processing can minimize disk I/O and communication costs. Finally, the research is proved to be more effective than existing solutions through a cost-based comparison of joins using different approaches.