Extracting k most important groups from data efficiently

Authors:
Man Lung Yiu;Nikos Mamoulis;Vagelis Hristidis
Affiliations:
Department of Computer Science, Aalborg University, DK-9220 Aalborg, Denmark;Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong;School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
Venue:
Data & Knowledge Engineering
Year:
2008

Citing 20
Cited 2

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Wide area traffic: the failure of Poisson modeling

IEEE/ACM Transactions on Networking (TON)
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Bottom-up computation of sparse and Iceberg CUBE

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Progressive approximate aggregate queries with a multi-resolution tree structure

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Analysis of pre-computed partition top method for range top-k queries in OLAP data cubes

Proceedings of the eleventh international conference on Information and knowledge management
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
What's hot and what's not: tracking most frequent items dynamically

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Mining Top.K Frequent Closed Patterns without Minimum Support

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Evaluating Top-k Queries over Web-Accessible Databases

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Supporting ad-hoc ranking aggregates

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient top-k aggregation of ranked inputs

ACM Transactions on Database Systems (TODS)
Supporting top-K join queries in relational databases

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory
Evaluation of top-k OLAP queries using aggregate r–trees

SSTD'05 Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases

Optimal top-k generation of attribute combinations based on ranked lists

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A thin monitoring layer for top-k aggregation queries over a database

Proceedings of the 7th International Workshop on Ranking in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is ''find the 10 combinations of product-type and month with the largest sum of sales''. The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms.