A scalable hash ripple join algorithm

Authors:
Gang Luo;Curt J. Ellmann;Peter J. Haas;Jeffrey F. Naughton
Affiliations:
University of Wisconsin-Madison;NCR Advance Development Lab;IBM Almaden Research Center;University of Wisconsin-Madison
Venue:
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Year:
2002

Citing 11
Cited 24

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Adaptive parallel aggregation algorithms

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Selectivity and cost estimation for joins based on random sampling

Journal of Computer and System Sciences
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
An adaptive query execution system for data integration

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Online Feedback for Nested Aggregate Queries with Multi-Threading

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Benchmarking Database Systems A Systematic Approach

VLDB '83 Proceedings of the 9th International Conference on Very Large Data Bases
Large-Sample and Deterministic Confidence Intervals for Online Aggregation

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management

Consistent database sampling as a database prototyping approach

Journal of Software Maintenance: Research and Practice
Hash-Merge Join: A Non-blocking Join Algorithm for Producing Fast and Early Join Results

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Early hash join: a configurable algorithm for the efficient and early production of join results

VLDB '05 Proceedings of the 31st international conference on Very large data bases
NSJ: an efficient non-blocking spatial join algorithm

GIS '06 Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems
The Sort-Merge-Shrink join

ACM Transactions on Database Systems (TODS)
Online Random Shuffling of Large Database Tables

IEEE Transactions on Knowledge and Data Engineering
Scalable approximate query processing with the DBO engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The effect of reading policy on early join result production

Information Sciences: an International Journal
A transducer-based XML query processor

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Scalable approximate query processing with the DBO engine

ACM Transactions on Database Systems (TODS)
A Vision for Next Generation Query Processors and an Associated Research Agenda

Globe '09 Proceedings of the 2nd International Conference on Data Management in Grid and Peer-to-Peer Systems
Turbo-charging estimate convergence in DBO

Proceedings of the VLDB Endowment
A formal framework for database sampling

Information and Software Technology
PR-join: a non-blocking join achieving higher early result rate with statistical guarantees

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Continuous sampling for online aggregation over multiple queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
The case for object databases in cloud data management

ICOODB'10 Proceedings of the Third international conference on Objects and databases
Improving online aggregation performance for skewed data distribution

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
You can stop early with COLA: online processing of aggregate queries in the cloud

Proceedings of the 21st ACM international conference on Information and knowledge management
Processing online aggregation on skewed data in mapreduce

Proceedings of the fifth international workshop on Cloud data management
Sampling estimators for parallel online aggregation

BNCOD'13 Proceedings of the 29th British National conference on Big Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, Haas and Hellerstein proposed the hash ripple join algorithm in the context of online aggregation. Although the algorithm rapidly gives a good estimate for many join-aggregate problem instances, the convergence can be slow if the number of tuples that satisfy the join predicate is small or if there are many groups in the output. Furthermore, if memory overflows (for example, because the user allows the algorithm to run to completion for an exact answer), the algorithm degenerates to block ripple join and performance suffers. In this paper, we build on the work of Haas and Hellerstein and propose a new algorithm that (a) combines parallelism with sampling to speed convergence, and (b) maintains good performance in the presence of memory overflow. Results from a prototype implementation in a parallel DBMS show that its rate of convergence scales with the number of processors, and that when allowed to run to completion, even in the presence of memory overflow, it is competitive with the traditional parallel hybrid hash join algorithm.