Memory footprint matters: efficient equi-join algorithms for main memory data processing

Authors:
Spyros Blanas;Jignesh M. Patel
Affiliations:
University of Wisconsin--Madison;University of Wisconsin--Madison
Venue:
Proceedings of the 4th annual Symposium on Cloud Computing
Year:
2013

Citing 24
Cited 0

Encapsulation of parallelism in the Volcano query processing system

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Introspective sorting and selection algorithms

Software—Practice & Experience
Expected Length of the Longest Probe Sequence in Hash Code Searching

Journal of the ACM (JACM)
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Optimizing Main-Memory Join on Modern Hardware

IEEE Transactions on Knowledge and Data Engineering
Sort-Merge-Join: An Idea Whose Time Has(h) Passed?

Proceedings of the Tenth International Conference on Data Engineering
Database Architecture Optimized for the New Bottleneck: Memory Access

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
QPipe: a simultaneously pipelined relational query engine

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Inspector joins

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Implementing sorting in database systems

ACM Computing Surveys (CSUR)
Improving hash join performance through prefetching

ACM Transactions on Database Systems (TODS)
Data partitioning on chip multiprocessors

Proceedings of the 4th international workshop on Data management on new hardware
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
The DataPath system: a data-centric analytic processing engine for large data warehouses

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Design and evaluation of main memory hash join algorithms for multi-core CPUs

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficiently compiling efficient query plans for modern hardware

Proceedings of the VLDB Endowment
SAP HANA database: data management for modern business applications

ACM SIGMOD Record
SharedDB: killing one thousand queries with one stone

Proceedings of the VLDB Endowment
Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Massively parallel sort-merge joins in main memory multi-core database systems

Proceedings of the VLDB Endowment
Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware

ICDE '13 Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-performance analytical data processing systems often run on servers with large amounts of main memory. A common operation in such environments is combining data from two or more sources using some "join" algorithm. The focus of this paper is on studying hash-based and sort-based equi-join algorithms when the data sets being joined fully reside in main memory. We only consider a single node setting, which is an important building block for larger high-performance distributed data processing systems. A critical contribution of this work is in pointing out that in addition to query response time, one must also consider the memory footprint of each join algorithm, as it impacts the number of concurrent queries that can be serviced. Memory footprint becomes an important deployment consideration when running analytical data processing services on hardware that is shared by other concurrent services. We also consider the impact of particular physical properties of the input and the output of each join algorithm. This information is essential for optimizing complex query pipelines with multiple joins. Our key contribution is in characterizing the properties of hash-based and sort-based equi-join algorithms, thereby allowing system implementers and query optimizers to make a more informed choice about which join algorithm to use.