New algorithms for join and grouping operations

Authors:
Goetz Graefe
Affiliations:
Hewlett-Packard Laboratories, Madison, USA
Venue:
Computer Science - Research and Development
Year:
2012

Citing 55
Cited 1

Join processing in database systems with large main memories

ACM Transactions on Database Systems (TODS)
The EXODUS optimizer generator

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Grammar-like functional rules for representing query optimization alternatives

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
On the translation of relational queries into iterative programs

ACM Transactions on Database Systems (TODS)
Dynamic query evaluation plans

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Merging sorted runs using large main memory

Acta Informatica
Single table access using multiple indexes: optimization, execution, and concurrency control techniques

EDBT '90 Proceedings of the 2nd international conference on extending database technology: Advances in Database Technology
The effect of bucket size tuning in the dynamic hybrid GRACE hash join method

VLDB '89 Proceedings of the 15th international conference on Very large data bases
A performance evaluation of pointer-based joins

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Randomized algorithms for optimizing large join queries

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Efficient assembly for complex objects

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Algorithms for creating indexes for very large tables without quiescing updates

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Working with Persistent Objects: To Swizzle or Not to Swizzle

IEEE Transactions on Software Engineering
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Optimization of dynamic query evaluation plans

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Fast algorithms for universal quantification in large databases

ACM Transactions on Database Systems (TODS)
Fundamental techniques for order optimization

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
On saying “Enough already!” in SQL

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Memory management during run generation in external sorting

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Optimization techniques for queries with expensive methods

ACM Transactions on Database Systems (TODS)
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Prefix B-trees

ACM Transactions on Database Systems (TODS)
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
A new way to compute the product and join of relations

SIGMOD '80 Proceedings of the 1980 ACM SIGMOD international conference on Management of data
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Sort vs. Hash Revisited

IEEE Transactions on Knowledge and Data Engineering
Nested Loops Revisited

PDIS '93 Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems
Hash Joins and Hash Teams in Microsoft SQL Server

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Diag-Join: An Opportunistic Join Algorithm for 1:N Relationships

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Hashing Methods and Relational Algebra Operations

VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
Buffering and Read-Ahead Strategies for External Mergesort

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Overview of The System Software of A Parallel Relational Database Machine GRACE

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
An Observation on Database Buffering Performance Metrics

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Generalised Hash Teams for Join and Group-by

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Hash-Partitioned Join Method Using Dynamic Destaging Strategy

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
An Adaptive Hash Join Algorithm for Multiuser Environments

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Memory-Adaptive External Sorting

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Including Group-By in Query Optimization

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sing the truth about ad hoc join costs

The VLDB Journal — The International Journal on Very Large Data Bases
Query processing and optimization in Oracle Rdb

The VLDB Journal — The International Journal on Very Large Data Bases
External Sorting: Run Formation Revisited

IEEE Transactions on Knowledge and Data Engineering
LEO: An autonomic query optimizer for DB2

IBM Systems Journal
Content-based routing: different plans for different data

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Implementing sorting in database systems

ACM Computing Surveys (CSUR)
B-tree indexes, interpolation search, and skew

DaMoN '06 Proceedings of the 2nd international workshop on Data management on new hardware
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Query processing in a relational database management system

VLDB '79 Proceedings of the fifth international conference on Very Large Data Bases - Volume 5
Multiprocessor hash-based join algorithms

VLDB '85 Proceedings of the 11th international conference on Very Large Data Bases - Volume 11
Streaming queries over streaming data

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Progressive merge join: a generic and non-blocking sort-based join algorithm

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Self-selecting, self-tuning, incrementally optimized indexes

Proceedings of the 13th International Conference on Extending Database Technology
A survey of B-tree locking techniques

ACM Transactions on Database Systems (TODS)

Massively parallel sort-merge joins in main memory multi-core database systems

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.02

Visualization

Abstract

Traditional database query processing relies on three types of algorithms for join and for grouping operations. For joins, index nested loops join exploits an index on its inner input, merge join exploits sorted inputs, and hash join exploits differences in the sizes of the join inputs. For grouping, an index-based algorithm has been used in the past whereas today sort- and hash-based algorithms prevail. Cost-based query optimization chooses the most appropriate algorithm for each query and for each operation. Unfortunately, mistaken algorithm choices during compile-time query optimization are common yet expensive to investigate and to resolve.Our goal is to end mistaken choices among join algorithms and among grouping algorithms by replacing the three traditional types of algorithms with a single one. Like merge join, this new join algorithm exploits sorted inputs. Like hash join, it exploits different input sizes for unsorted inputs. In fact, for unsorted inputs, the cost functions for recursive hash join and for hybrid hash join have guided our search for the new join algorithm. In consequence, the new join algorithm can replace both merge join and hash join in a database management system.The in-memory components of the new join algorithm employ indexes. If the database contains indexes for one (or both) of the inputs, the new join can exploit persistent indexes instead of temporary in-memory indexes. Using database indexes to find matching input records, the new join algorithm can also replace index nested loops join.In addition to join operations, a very similar algorithm supports grouping ("group by" queries in SQL) and duplicate elimination. For unsorted inputs, candidate output records take on the role of one of the inputs in a join operation. Our goal is to define a single grouping algorithm that can replace grouping by repeated index searches, by sorting, and by hashing. In other words, our goal is to end mistaken algorithm choices not only for joins and other binary matching operations but also for grouping and other unary matching operations in database query processing.Finally, these new algorithms can be instrumental for efficient and robust data processing in a map-reduce environment, because `map' and `reduce' operations are similar in essentials to join and grouping operations.Results from an implementation of the core algorithm are reported.