A comparison of join algorithms for log processing in MaPreduce

Authors:
Spyros Blanas;Jignesh M. Patel;Vuk Ercegovac;Jun Rao;Eugene J. Shekita;Yuanyuan Tian
Affiliations:
University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA;IBM Almaden Research Center, San Jose, CA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 12
Cited 45

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Join processing in relational databases

ACM Computing Surveys (CSUR)
Parallel database systems: the future of high performance database systems

Communications of the ACM
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Query processing in a system for distributed databases (SDD-1)

ACM Transactions on Database Systems (TODS)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Making standard ML a practical database programming language

Proceedings of the 16th ACM SIGPLAN international conference on Functional programming
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Efficient data distribution strategy for join query processing in the cloud

Proceedings of the third international workshop on Cloud data management
Efficient processing of RDF graph pattern matching on MapReduce platforms

Proceedings of the second international workshop on Data intensive computing in the clouds
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Matrix chain multiplication via multi-way join algorithms in MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
RDFPath: path query processing on large RDF graphs with mapreduce

ESWC'11 Proceedings of the 8th international conference on The Semantic Web
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
MapReduce-based similarity join for metric spaces

Proceedings of the 1st International Workshop on Cloud Intelligence
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
Parallel rough set based knowledge acquisition using MapReduce from big data

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Towards efficient join processing over large RDF graph using mapreduce

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
T: a data-centric cooling energy costs reduction approach for big data analytics cloud

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
HEDC: a histogram estimator for data in the cloud

Proceedings of the fourth international workshop on Cloud data management
You can stop early with COLA: online processing of aggregate queries in the cloud

Proceedings of the 21st ACM international conference on Information and knowledge management
Join processing using Bloom filter in MapReduce

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
An efficient programming model for memory-intensive recursive algorithms using parallel disks

Proceedings of the 37th International Symposium on Symbolic and Algebraic Computation
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Processing multi-way spatial joins on map-reduce

Proceedings of the 16th International Conference on Extending Database Technology
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Photon: fault-tolerant and scalable joining of continuous data streams

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Integrating scale out and fault tolerance in stream processing using operator state management

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cloud MapReduce for particle filter-based data assimilation for wildfire spread simulation

Proceedings of the High Performance Computing Symposium
Cache conscious star-join in MapReduce environments

Proceedings of the 2nd International Workshop on Cloud Intelligence
Toward intersection filter-based optimization for joins in MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Distributed matrix factorization with mapreduce using a series of broadcast-joins

Proceedings of the 7th ACM conference on Recommender systems
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment
A map-reduce lagrangian heuristic for multidimensional assignment problems with decomposable costs

Parallel Computing
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering
Exploiting inter-operation parallelism for matrix chain multiplication using MapReduce

The Journal of Supercomputing
A comparison of parallel large-scale knowledge acquisition using rough set theory on different MapReduce runtime systems

International Journal of Approximate Reasoning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-stream or an event log is filtered, aggregated, or mined for patterns. As part of this analysis, the log often needs to be joined with reference data such as information about users. Although there have been many studies examining join algorithms in parallel and distributed DBMSs, the MapReduce framework is cumbersome for joins. MapReduce programmers often use simple but inefficient algorithms to perform joins. In this paper, we describe crucial implementation details of a number of well-known join strategies in MapReduce, and present a comprehensive experimental comparison of these join techniques on a 100-node Hadoop cluster. Our results provide insights that are unique to the MapReduce platform and offer guidance on when to use a particular join algorithm on this platform.