Processing set expressions over continuous update streams

Authors:
Sumit Ganguly;Minos Garofalakis;Rajeev Rastogi
Affiliations:
Bell Laboratories, Lucent Technologies, Murray Hill, NJ;Bell Laboratories, Lucent Technologies, Murray Hill, NJ;Bell Laboratories, Lucent Technologies, Murray Hill, NJ
Venue:
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Year:
2003

Citing 20
Cited 40

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Introduction to algorithms

Introduction to algorithms
Understanding the new SQL: a complete guide

Understanding the new SQL: a complete guide
The probabilistic communication complexity of set intersection

SIAM Journal on Discrete Mathematics
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Communication complexity

Communication complexity
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A small approximately min-wise independent family of hash functions

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Selectively estimation for Boolean queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Approximate Aggregation Techniques for Sensor Databases

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Spatio-Temporal Aggregation Using Sketches

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Finding hot query patterns over an XQuery stream

The VLDB Journal — The International Journal on Very Large Data Bases
Tracking set-expression cardinalities over continuous update streams

The VLDB Journal — The International Journal on Very Large Data Bases
Maintaining Implicated Statistics in Constrained Environments

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Coresets in dynamic geometric data streams

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Sampling in dynamic data streams and applications

SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Space efficient mining of multigraph streams

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Holistic aggregates in a networked world: distributed tracking of approximate quantiles

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
BRAID: stream mining through group lag correlations

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Quality-driven evaluation of trigger conditions on streaming time series

Proceedings of the 2005 ACM symposium on Applied computing
Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Streaming pattern discovery in multiple time-series

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Online estimation for subset-based SQL queries

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Maintaining Sliding Window Skylines on Data Streams

IEEE Transactions on Knowledge and Data Engineering
Fast range-summable random variables for efficient aggregate estimation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
Pseudo-random number generation for sketch-based estimations

ACM Transactions on Database Systems (TODS)
Counting distinct items over update streams

Theoretical Computer Science
Effective variation management for pseudo periodical streams

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Sharing aggregate computation for distributed queries

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed set-expression cardinality estimation

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
CountTorrent: ubiquitous access to query aggregates in dynamic and mobile sensor networks

Proceedings of the 5th international conference on Embedded networked sensor systems
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Approximate continuous querying over distributed streams

ACM Transactions on Database Systems (TODS)
Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

ACM Transactions on Computer Systems (TOCS)
Robust approximate aggregation in sensor data management systems

ACM Transactions on Database Systems (TODS)
PGG: an online pattern based approach for stream variation management

Journal of Computer Science and Technology
Statistical structures for Internet-scale data management

The VLDB Journal — The International Journal on Very Large Data Bases
Online amnesic summarization of streaming locations

SSTD'07 Proceedings of the 10th international conference on Advances in spatial and temporal databases
Exponential time improvement for min-wise based algorithms

Information and Computation
Distinct estimate of set expressions over sliding windows

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
On the usage of global document occurrences in peer-to-peer information systems

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
IQN routing: integrating quality and novelty in P2P querying and ranking

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Counting distinct items over update streams

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Exponential time improvement for min-wise based algorithms

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Pattern discovery in data streams under the time warping distance

The VLDB Journal — The International Journal on Very Large Data Bases
Sketch-based geometric monitoring of distributed stream queries

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2-level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.