Fast range-summable random variables for efficient aggregate estimation

Authors:
Florin Rusu;Alin Dobra
Affiliations:
University of Florida;University of Florida
Venue:
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Year:
2006

Citing 17
Cited 3

A fast and simple randomized parallel algorithm for the maximal independent set problem

Journal of Algorithms
On construction of k-wise independent random variables

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Reductions in streaming algorithms, with an application to counting triangles in graphs

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Processing complex aggregate queries over data streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
An Approximate L1-Difference Algorithm for Massive Data Streams

SIAM Journal on Computing
One-Pass Wavelet Decompositions of Data Streams

IEEE Transactions on Knowledge and Data Engineering
Processing set expressions over continuous update streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Pairwise Independence and Derandomization

Pairwise Independence and Derandomization
The Computational Complexity of ({\it XOR, AND\/})-Counting Problems

The Computational Complexity of ({\'it XOR, AND\'/})-Counting Problems
Approximating the Number of Solutions of a {\ G F [ 2 ]} Polynomial

Approximating the Number of Solutions of a {\' G F [ 2 ]} Polynomial
Gossip-Based Computation of Aggregate Information

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Approximation techniques for spatial data

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Range-Efficient Computation of F" over Massive Data Streams

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Improved range-summable random variable construction algorithms

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Domain-Driven Data Synopses for Dynamic Quantiles

IEEE Transactions on Knowledge and Data Engineering

Statistical analysis of sketch estimators

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Rectangle-efficient aggregation in spatial data streams

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exact computation for aggregate queries usually requires large amounts of memory - constrained in data-streaming - or communication - constrained in distributed computation - and large processing times. In this situation, approximation techniques with provable guarantees, like sketches, are the only viable solution. The performance of sketches crucially depends on the ability to efficiently generate particular pseudo-random numbers. In this paper we investigate both theoretically and empirically the problem of generating k-wise independent pseudo-random numbers and, in particular, that of generating 3 and 4-wise independent pseudo-random numbers that are fast range-summable (i.e., they can be summed up in sub-linear time). Our specific contributions are: (a) we provide an empirical comparison of the various pseudo-random number generating schemes, (b) we study both theoretically and empirically the fast range-summation practicality for the 3 and 4-wise independent generating schemes and we provide efficient implementations for the 3-wise independent schemes, (c) we show convincing theoretical and empirical evidence that the extended Hamming scheme performs as well as any 4-wise independent scheme for estimating the size of join using AMS-sketches, even though it is only 3-wise independent. We use this generating scheme to produce estimators that significantly out-perform the state-of-the-art solutions for two problems - size of spatial joins and selectivity estimation.