PR-join: a non-blocking join achieving higher early result rate with statistical guarantees

Authors:
Shimin Chen;Phillip B. Gibbons;Suman Nath
Affiliations:
Intel Labs Pittsburgh, Pittsburgh, PA, USA;Intel Labs Pittsburgh, Pittsburgh, PA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 22
Cited 3

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Informix under CONTROL: Online Query Processing

Data Mining and Knowledge Discovery
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
RPJ: producing fast join results on streams through rate-based optimization

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Early hash join: a configurable algorithm for the efficient and early production of join results

VLDB '05 Proceedings of the 31st international conference on Very large data bases
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
FlashDB: dynamic self-tuning database for NAND flash

Proceedings of the 6th international conference on Information processing in sensor networks
Progressive merge join: a generic and non-blocking sort-based join algorithm

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The five-minute rule twenty years later, and how flash memory changes the rules

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
A case for flash memory ssd in enterprise database applications

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Scalable approximate query processing with the DBO engine

ACM Transactions on Database Systems (TODS)
Flashing up the storage layer

Proceedings of the VLDB Endowment
Online maintenance of very large random samples on flash storage

Proceedings of the VLDB Endowment
Query processing techniques for solid state drives

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
FlashLogging: exploiting flash devices for synchronous logging performance

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
An object placement advisor for DB2 using solid state storage

Proceedings of the VLDB Endowment
Lazy-Adaptive Tree: an optimized index structure for flash devices

Proceedings of the VLDB Endowment
Turbo-charging estimate convergence in DBO

Proceedings of the VLDB Endowment

Driver input selection for main-memory multi-way joins

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Processing online aggregation on skewed data in mapreduce

Proceedings of the fifth international workshop on Cloud data management
Sampling estimators for parallel online aggregation

BNCOD'13 Proceedings of the 29th British National conference on Big Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online aggregation is a promising solution to achieving fast early responses for interactive ad-hoc queries that compute aggregates on a large amount of data. Essential to the success of online aggregation is a good non-blocking join algorithm that enables both (i) high early result rates with statistical guarantees and (ii) fast end-to-end query times. We analyze existing non-blocking join algorithms and find that they all provide sub-optimal early result rates, and those with fast end-to-end times achieve them only by further sacrificing their early result rates. We propose a new non-blocking join algorithm, Partitioned expanding Ripple Join (PR-Join), which achieves considerably higher early result rates than previous non-blocking joins, while also delivering fast end-to-end query times. PR-Join performs separate, ripple-like join operations on individual hash partitions, where the width of a ripple expands multiplicatively over time. This contrasts with the non-partitioned, fixed-width ripples of Block Ripple Join. Assuming, as in previous non-blocking join studies, that the input relations are in random order, PR-Join ensures representative early results that are amenable to statistical guarantees. We show both analytically and with real-machine experiments that PR-Join achieves over an order of magnitude higher early result rates than previous non-blocking joins. We also discuss the benefits of using a flash-based SSD for temporary storage, showing that PR-Join can then achieve close to optimal end-to-end performance. Finally, we consider the joining of finite data streams that arrive over time, and find that PR-Join achieves similar or higher result rates than RPJ, the state-of-the-art algorithm specialized for that domain.