Indexing for summary queries: Theory and practice

Authors:
Ke Yi;Lu Wang;Zhewei Wei
Affiliations:
Tsinghua University and Hong Kong University of Science and Technology;Hong Kong University of Science and Technology;MADALGO and Aarhus University
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2014

Citing 32
Cited 0

Handbook of algorithms and data structures: in Pascal and C (2nd ed.)

Handbook of algorithms and data structures: in Pascal and C (2nd ed.)
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
On computing correlated aggregates over continual data streams

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Optimal External Memory Interval Management

SIAM Journal on Computing
Spatio-Temporal Aggregation Using Sketches

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Medians and beyond: new aggregation techniques for sensor networks

SenSys '04 Proceedings of the 2nd international conference on Embedded networked sensor systems
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
The complexity of massive data set computations

The complexity of massive data set computations
An integrated efficient solution for computing frequent and top-k elements in data streams

ACM Transactions on Database Systems (TODS)
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Enhancing histograms by tree-like bucket indices

The VLDB Journal — The International Journal on Very Large Data Bases
Scalable approximate query processing with the DBO engine

ACM Transactions on Database Systems (TODS)
Finding frequent items in data streams

Proceedings of the VLDB Endowment
Algorithms and Data Structures for External Memory

Algorithms and Data Structures for External Memory
Distance-Based Representative Skyline

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Towards optimal range medians

Theoretical Computer Science
Beyond simple aggregates: indexing for summary queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling based algorithms for quantile computation in sensor networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Ordered and unordered top-K range reporting in large data sets

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Range selection and median: tight cell probe lower bounds and adaptive data structures

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Mergeable summaries

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Space-efficient estimation of statistics over sub-sampled streams

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this article, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, and various sketches, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost. We illustrate the efficiency and usefulness of our designs through extensive experiments and a system demonstration.