The DLT priority sampling is essentially optimal

Authors:
Mario Szegedy
Affiliations:
Rutgers, The State University of NJ, Piscataway, NJ
Venue:
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Year:
2006

Citing 5
Cited 18

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Estimating flow distributions from sampled flow statistics

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Estimating arbitrary subset sums with few probes

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory

Confidence intervals for priority sampling

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Confident estimation for multistage measurement sampling and aggregation

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
On the variance of subset sum estimation

ESA'07 Proceedings of the 15th annual European conference on Algorithms
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Detecting adversarial advertisements in the wild

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimal sampling from sliding windows

Journal of Computer and System Sciences
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

SIAM Journal on Computing
Estimating sum by weighted sampling

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Content placement via the exponential potential function method

IPCO'13 Proceedings of the 16th international conference on Integer Programming and Combinatorial Optimization
Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The priority sampling procedure of N. Duffield, C. Lund and M. Thorup is not only an exciting new approach to sampling weighted data streams, but it has also proven to be highly successful in a variety of practical applications. We resolve the two major issues related to its performance. First we solve the main conjecture of N. Alon, N. Duffield, C. Lund and M. Thorup in [1], which states that the standard deviation for the subset sum estimator obtained from k priority samples is upper bounded by W/√k-1, where W denotes the actual subset sum that the estimator estimates. Although Alon et al. give an O(W/√k-1) upper bound on the standard deviation of the estimator, their formula cannot be used as a performance guarantee in an applied setting, because the constants coming up in their proof are very large. Our constant cannot be improved. We also resolve the conjecture of Duffield, C. Lund and M. Thorup which states that the variance of the priority sampling procedure is not larger than the variance of the threshold sampling procedure with sample size only one smaller. This is the main conjecture in [7]. The conjecture's significance is that the latter procedure is provably optimal within a very general class of sampling algorithms, but unlike priority sampling, it is not practical. Our result therefore certifies that priority sampling offers the unlikely feat of uniting mathematical elegance, (essential) optimality and applicability. Our proof is based on a new integral formula and on very finely tuned telescopic sums.