Join-distinct aggregate estimation over update streams

Authors:
Sumit Ganguly;Minos Garofalakis;Amit Kumar;Rajeev Rastogi
Affiliations:
IIT Kanpur;Bell Laboratories;IIT Delhi;Bell Laboratories
Venue:
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2005

Citing 20
Cited 9

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Introduction to finite fields and their applications

Introduction to finite fields and their applications
Introduction to algorithms

Introduction to algorithms
Understanding the new SQL: a complete guide

Understanding the new SQL: a complete guide
The probabilistic communication complexity of set intersection

SIAM Journal on Discrete Mathematics
Randomized algorithms

Randomized algorithms
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Communication complexity

Communication complexity
Tracking join and self-join sizes in limited storage

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Tracking set-expression cardinalities over continuous update streams

The VLDB Journal — The International Journal on Very Large Data Bases
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Classification spanning correlated data streams

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Dynamic adaptive data structures for monitoring data streams

Data & Knowledge Engineering
Finding Frequent Items in a Turnstile Data Stream

COCOON '08 Proceedings of the 14th annual international conference on Computing and Combinatorics
Better size estimation for sparse matrix products

APPROX/RANDOM'10 Proceedings of the 13th international conference on Approximation, and 14 the International conference on Randomization, and combinatorial optimization: algorithms and techniques
Regression on evolving multi-relational data streams

Proceedings of the 2011 Joint EDBT/ICDT Ph.D. Workshop
Cardinality computing: a new step towards fully representing multi-sets by bloom filters

WISE'06 Proceedings of the 7th international conference on Web Information Systems
On estimating path aggregates over streaming graphs

ISAAC'06 Proceedings of the 17th international conference on Algorithms and Computation
Arthur-Merlin streaming complexity

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is growing interest in algorithms for processing andquerying continuous data streams (i.e., data that is seenonly once in a fixed order) with limited memory resources.Providing (perhaps approximate) answers to queries over suchstreams is a crucial requirement for many application environments;examples include large IP network installations where performancedata from different parts of the network needs to be continuouslycollected and analyzed.The ability to estimate the number of distinct (sub)tuples inthe result of a join operation correlating two data streams (i.e.,the cardinality of a projection with duplicate elimination over ajoin) is an important requirement for several data-analysisscenarios. For instance, to enable real-time traffic analysis andload balancing, a network-monitoring application may need toestimate the number of distinct (source,destination) IP-address pairs occurring in the stream of IP packetsobserved by router R1,where the source address is also seen in packets routed through adifferent router R2.Earlier work has presented solutions to the individual problems ofdistinct counting and join-size estimation (without duplicateelimination) over streams. These solutions, however, arefundamentally different and extending or combining them to handleour more complex "Join-Distinct" estimation problem is far fromobvious. In this paper, we propose the firstspace-efficient algorithmic solution to the general Join-Distinctestimation problem over continuous data streams (our techniques canactually handle general update streamscomprising tuple deletions as well as insertions). Our estimatorsare probabilistic in nature and rely on novel algorithms forbuilding and combining a new class of hash-based synopses (termed"JD sketches") for individual update streams. Wedemonstrate that our algorithms can provide low error,high-confidence Join-Distinct estimates using only small space andsmall processing time per update. In fact, we present lower boundsshowing that the space usage of our estimators is within smallfactors of the best possible for the Join-Distinct problem.Preliminary experimental results verify the effectiveness of ourapproach.