Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Introduction to algorithms
Understanding the new SQL: a complete guide
Understanding the new SQL: a complete guide
The probabilistic communication complexity of set intersection
SIAM Journal on Discrete Mathematics
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Communication complexity
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Tracking join and self-join sizes in limited storage
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A small approximately min-wise independent family of hash functions
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Processing complex aggregate queries over data streams
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
An Approximate L1-Difference Algorithm for Massive Data Streams
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
How to summarize the universe: dynamic maintenance of quantiles
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Approximate Aggregation Techniques for Sensor Databases
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Spatio-Temporal Aggregation Using Sketches
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Finding hot query patterns over an XQuery stream
The VLDB Journal — The International Journal on Very Large Data Bases
Tracking set-expression cardinalities over continuous update streams
The VLDB Journal — The International Journal on Very Large Data Bases
Maintaining Implicated Statistics in Constrained Environments
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Coresets in dynamic geometric data streams
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Sampling in dynamic data streams and applications
SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Space efficient mining of multigraph streams
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Holistic aggregates in a networked world: distributed tracking of approximate quantiles
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
BRAID: stream mining through group lag correlations
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Quality-driven evaluation of trigger conditions on streaming time series
Proceedings of the 2005 ACM symposium on Applied computing
Improving collection selection with overlap awareness in P2P search engines
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Streaming pattern discovery in multiple time-series
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Online estimation for subset-based SQL queries
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Maintaining Sliding Window Skylines on Data Streams
IEEE Transactions on Knowledge and Data Engineering
Fast range-summable random variables for efficient aggregate estimation
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Random Sampling for Continuous Streams with Arbitrary Updates
IEEE Transactions on Knowledge and Data Engineering
Pseudo-random number generation for sketch-based estimations
ACM Transactions on Database Systems (TODS)
Counting distinct items over update streams
Theoretical Computer Science
Effective variation management for pseudo periodical streams
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Sharing aggregate computation for distributed queries
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Distributed set-expression cardinality estimation
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
CountTorrent: ubiquitous access to query aggregates in dynamic and mobile sensor networks
Proceedings of the 5th international conference on Embedded networked sensor systems
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Approximate continuous querying over distributed streams
ACM Transactions on Database Systems (TODS)
ACM Transactions on Computer Systems (TOCS)
Robust approximate aggregation in sensor data management systems
ACM Transactions on Database Systems (TODS)
PGG: an online pattern based approach for stream variation management
Journal of Computer Science and Technology
Statistical structures for Internet-scale data management
The VLDB Journal — The International Journal on Very Large Data Bases
Online amnesic summarization of streaming locations
SSTD'07 Proceedings of the 10th international conference on Advances in spatial and temporal databases
Exponential time improvement for min-wise based algorithms
Information and Computation
Distinct estimate of set expressions over sliding windows
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
On the usage of global document occurrences in peer-to-peer information systems
OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
IQN routing: integrating quality and novelty in P2P querying and ranking
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Counting distinct items over update streams
ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Exponential time improvement for min-wise based algorithms
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Pattern discovery in data streams under the time warping distance
The VLDB Journal — The International Journal on Very Large Data Bases
Sketch-based geometric monitoring of distributed stream queries
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as insertions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions).Estimating the cardinality of set expressions defined over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an example, such a query may ask "what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only addressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and specific operators (e.g., union). In this paper, we propose the first space-efficient algorithmic solution for estimating the cardinality of full-fledged set expressions over general update streams. Our estimation algorithms are probabilistic in nature and rely on a novel, hash-based synopsis data structure, termed "2-level hash sketch". We demonstrate how our 2-level hash sketch synopses can be used to provide low-error, high-confidence estimates for the cardinality of set expressions (including operators such as set union, intersection, and difference) over continuous update streams, using only small space and small processing time per update. Furthermore, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algorithms is within small factors of the optimal. Preliminary experimental results verify the effectiveness of our approach.