Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Randomized algorithms
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Communication complexity
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining stream statistics over sliding windows: (extended abstract)
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Querying and mining data streams: you only get one look a tutorial
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
New directions in traffic measurement and accounting
Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Data streams: algorithms and applications
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Estimating Rarity and Similarity over Data Stream Windows
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
What's hot and what's not: tracking most frequent items dynamically
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Processing set expressions over continuous update streams
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Gigascope: a stream database for network applications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Aurora: a data stream management system
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
TelegraphCQ: continuous dataflow processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Holistic UDAFs at streaming speeds
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Coresets in dynamic geometric data streams
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A dip in the reservoir: maintaining sample synopses of evolving datasets
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Efficient stream sequence matching algorithms for handheld devices on time-series stream data
DBA'06 Proceedings of the 24th IASTED international conference on Database and applications
Random Sampling for Continuous Streams with Arbitrary Updates
IEEE Transactions on Knowledge and Data Engineering
Maintaining bernoulli samples over evolving multisets
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient Process of Top-k Range-Sum Queries over Multiple Streams with Minimized Global Error
IEEE Transactions on Knowledge and Data Engineering
A resource-minimalist flow size histogram estimator
Proceedings of the 8th ACM SIGCOMM conference on Internet measurement
Optimal sampling from sliding windows
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Small synopses for group-by query verification on outsourced data streams
ACM Transactions on Database Systems (TODS)
Fast Manhattan sketches in data streams
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximating sliding windows by cyclic tree-like histograms for efficient range queries
Data & Knowledge Engineering
1-pass relative-error Lp-sampling with applications
SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Near-optimal private approximation protocols via a black box transformation
Proceedings of the forty-third annual ACM symposium on Theory of computing
Fast moment estimation in data streams in optimal space
Proceedings of the forty-third annual ACM symposium on Theory of computing
What's the difference?: efficient set reconciliation without prior context
Proceedings of the ACM SIGCOMM 2011 conference
Optimal sampling from sliding windows
Journal of Computer and System Sciences
Analyzing graph structure via linear measurements
Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Graph sketches: sparsification, spanners, and subgraphs
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Don't let the negatives bring you down: sampling from streams of signed updates
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Sketching for big data recommender systems using fast pseudo-random fingerprints
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part II
Efficient sampling of non-strict turnstile data streams
FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
Non-uniformity issues and workarounds in bounded-size sampling
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be "viewed" in different ways. A data stream of integer values can be viewed either as the forward distribution f (x), ie., the number of occurrences of x in the stream, or as its inverse, f-1 (i), which is the number of items that appear i times. While both such "views" are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions.We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods.We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.