Linear hashing with separators—a dynamic hashing scheme achieving one-access
ACM Transactions on Database Systems (TODS)
Minimal space, average linear time duplicate deletion
Communications of the ACM
Query evaluation techniques for large databases
ACM Computing Surveys (CSUR)
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Computational Complexity of Sorting and Joining Relations with Duplicates
IEEE Transactions on Knowledge and Data Engineering
Parallelizing extensible query optimizers
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
SQL queries containing GROUP BY and aggregation occur frequently in decision support applications. Grouping with aggregation is typically done by first sorting the input and then performing the aggregation as part of the output phase of the sort. The most widely used external sorting algorithm is merge sort, consisting of a run formation phase followed by a (single) merge pass.The amount of data output from the run formation phase can be reduced by a technique that we call early grouping. The idea is straightforward: simply form groups and perform aggregation during run formation. Each run will now consist of partial groups instead of individual records. These partial groups are then combined during the merge phase.Early grouping always reduces the number of records output from the run formation phase. The relative output size depends on the amount of memory relative to the total number of groups and the distribution of records over groups. When the input data is uniformly distributed -- the worst case -- our simulation results show that the relative output size is proportional to the (relative) amount of memory used. When the data is skewed -- the more common case in practice -- the relative output size is much smaller.