Data reduction through early grouping

Authors:
W. Paul Yan;Paul Larson
Affiliations:
Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada;Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Venue:
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Year:
1994

Citing 6
Cited 2

Linear hashing with separators—a dynamic hashing scheme achieving one-access

ACM Transactions on Database Systems (TODS)
Minimal space, average linear time duplicate deletion

Communications of the ACM
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Computational Complexity of Sorting and Joining Relations with Duplicates

IEEE Transactions on Knowledge and Data Engineering

Parallelizing extensible query optimizers

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

SQL queries containing GROUP BY and aggregation occur frequently in decision support applications. Grouping with aggregation is typically done by first sorting the input and then performing the aggregation as part of the output phase of the sort. The most widely used external sorting algorithm is merge sort, consisting of a run formation phase followed by a (single) merge pass.The amount of data output from the run formation phase can be reduced by a technique that we call early grouping. The idea is straightforward: simply form groups and perform aggregation during run formation. Each run will now consist of partial groups instead of individual records. These partial groups are then combined during the merge phase.Early grouping always reduces the number of records output from the run formation phase. The relative output size depends on the amount of memory relative to the total number of groups and the distribution of records over groups. When the input data is uniformly distributed -- the worst case -- our simulation results show that the relative output size is proportional to the (relative) amount of memory used. When the data is skewed -- the more common case in practice -- the relative output size is much smaller.