Extracting k most important groups from data efficiently

  • Authors:
  • Man Lung Yiu;Nikos Mamoulis;Vagelis Hristidis

  • Affiliations:
  • Department of Computer Science, Aalborg University, DK-9220 Aalborg, Denmark;Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong;School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is ''find the 10 combinations of product-type and month with the largest sum of sales''. The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.