On the variance of subset sum estimation

Authors:
Mario Szegedy;Mikkel Thorup
Affiliations:
Department of Computer Science, Rutgers, The State University of New Jersey,;AT&T Labs-Research, Shannon Laboratory, Florham Park, NJ
Venue:
ESA'07 Proceedings of the 15th annual European conference on Algorithms
Year:
2007

Citing 9
Cited 9

Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
On the relationship between file sizes, transport protocols, and self-similar network traffic

ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
Flow sampling under hard resource constraints

Proceedings of the joint international conference on Measurement and modeling of computer systems
Estimating arbitrary subset sums with few probes

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory

Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Composable, scalable, and accurate weight summarization of unaggregated data sets

Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Structure-aware sampling on data streams

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Structure-aware sampling on data streams

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

SIAM Journal on Computing
Fair sampling across network flow measurements

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. We are dealing with a possibly heavy-tailed set of weighted items. We address the question: Which sampling scheme should we use to get the most accurate subset sum estimates? We present a simple theorem on the variance of subset sum estimation and use it to prove optimality and near-optimality of different known sampling schemes. The performance measure suggested in this paper is the average variance over all subsets of any given size. By optimal we mean there is no set of input weights for which any sampling scheme can have a better average variance. For example, we show that appropriately weighted systematic sampling is simultaneously optimal for all subset sizes. More standard schemes such as uniform sampling and probability-proportional-to-size sampling with replacement can be arbitrarily bad. Knowing the variance optimality of different sampling schemes can help deciding which sampling scheme to apply in a given context.