Data reduction through early grouping

  • Authors:
  • W. Paul Yan;Paul Larson

  • Affiliations:
  • Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada;Department of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada

  • Venue:
  • CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

SQL queries containing GROUP BY and aggregation occur frequently in decision support applications. Grouping with aggregation is typically done by first sorting the input and then performing the aggregation as part of the output phase of the sort. The most widely used external sorting algorithm is merge sort, consisting of a run formation phase followed by a (single) merge pass.The amount of data output from the run formation phase can be reduced by a technique that we call early grouping. The idea is straightforward: simply form groups and perform aggregation during run formation. Each run will now consist of partial groups instead of individual records. These partial groups are then combined during the merge phase.Early grouping always reduces the number of records output from the run formation phase. The relative output size depends on the amount of memory relative to the total number of groups and the distribution of records over groups. When the input data is uniformly distributed -- the worst case -- our simulation results show that the relative output size is proportional to the (relative) amount of memory used. When the data is skewed -- the more common case in practice -- the relative output size is much smaller.