Tracking set-expression cardinalities over continuous update streams

Authors:
Sumit Ganguly;Minos Garofalakis;Rajeev Rastogi
Affiliations:
Bell Laboratories, Lucent Technologies, USA;Bell Laboratories, Lucent Technologies, USA;Bell Laboratories, Lucent Technologies, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2004

Citing 28
Cited 10

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Introduction to algorithms

Introduction to algorithms
Understanding the new SQL: a complete guide

Understanding the new SQL: a complete guide
The probabilistic communication complexity of set intersection

SIAM Journal on Discrete Mathematics
Randomized algorithms

Randomized algorithms
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Communication complexity

Communication complexity
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A small approximately min-wise independent family of hash functions

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Selectively estimation for Boolean queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Processing set expressions over continuous update streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Join-distinct aggregate estimation over update streams

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A simple and efficient estimation method for stream expression cardinalities

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Two improved range-efficient algorithms for F0 estimation

Theoretical Computer Science
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Small synopses for group-by query verification on outsourced data streams

ACM Transactions on Database Systems (TODS)
Two improved range-efficient algorithms for F0 estimation

TAMC'07 Proceedings of the 4th international conference on Theory and applications of models of computation
Cardinality computing: a new step towards fully representing multi-sets by bloom filters

WISE'06 Proceedings of the 7th international conference on Web Information Systems
String similarity measures and joins with synonyms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is growing interest in algorithms for processing and querying continuous data streams (i.e., data seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations or processing of retail-chain transactions). Estimating the cardinality of set expressions defined over several (possibly distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask “what is the number of distinct IP source addresses seen in passing packets from both router R1 and R 2 but not router R3?”. Earlier work only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed ”2-level hash sketch”. We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only space that is significantly sublinear in the sizes of the streaming input (multi-)sets. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Finally, we propose an optimized, time-efficient stream synopsis (based on 2-level hash sketches) that provides similar, strong accuracy-space guarantees while requiring only guaranteed logarithmic maintenance time per update, thus making our methods applicable for truly rapid-rate data streams. Our results from an empirical study of our synopsis and estimation techniques verify the effectiveness of our approach.