Semi-join computation on distributed file systems using map-reduce-merge model

Authors:
M. Al Hajj Hassan;M. Bamha
Affiliations:
Université d'Orléans, Orléans Cedex, France;Université d'Orléans, Orléans Cedex, France
Venue:
Proceedings of the 2010 ACM Symposium on Applied Computing
Year:
2010

Citing 10
Cited 0

Frequency-adaptive join for shared nothing machines

Progress in computer research
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Integrating Semi-Join-Reducers into State of the Art Query Processors

Proceedings of the 17th International Conference on Data Engineering
Novel parallel join algorithms for grid files

HIPC '96 Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Google's MapReduce programming model — Revisited

Science of Computer Programming
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
An optimal skew-insensitive join and multi-join algorithm for distributed architectures

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Semi-join is the most used technique to optimize the treatment of complex relational queries on distributed architectures. However, the overhead related to semi-join computation can be very high due to data skew and to the high cost of communication in distributed architectures. Internet search engines needs to process vast amounts of raw data every day. Hence, systems that manage such data should assure scalability, reliability and availability issues with reasonable query processing time. Hadoop and Google's File System are examples of such systems. In this paper, we present a new algorithm based on Map-Reduce-Merge model and distributed histograms for processing semi-join operations on such systems. A cost analysis of this algorithm shows that our approach is insensitive to data skew while reducing communication and disk Input/Output costs to a minimum.