Estimating arbitrary subset sums with few probes

Authors:
Noga Alon;Nick Duffield;Carsten Lund;Mikkel Thorup
Affiliations:
Tel Aviv University, Tel Aviv, Israel;AT&T Labs---Research, NJ;AT&T Labs---Research, NJ;AT&T Labs---Research, NJ
Venue:
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2005

Citing 11
Cited 16

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On two-dimensional indexability and optimal range search indexing

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Aqua approximate query answering system

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
External memory data structures

Handbook of massive data sets
On the relationship between file sizes, transport protocols, and self-similar network traffic

ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
Flow sampling under hard resource constraints

Proceedings of the joint international conference on Measurement and modeling of computer systems
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory

The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Bottom-k sketches: better and more efficient estimation of aggregates

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
ProgME: towards programmable network measurement

Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Confident estimation for multistage measurement sampling and aggregation

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Distribution fairness in Internet-scale networks

ACM Transactions on Internet Technology (TOIT)
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
On the variance of subset sum estimation

ESA'07 Proceedings of the 15th annual European conference on Algorithms
ProgME: towards programmable network measurement

IEEE/ACM Transactions on Networking (TON)
Optimal sampling from sliding windows

Journal of Computer and System Sciences
Streams, security and scalability

DBSec'05 Proceedings of the 19th annual IFIP WG 11.3 working conference on Data and Applications Security
Estimating sum by weighted sampling

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Suppose we have a large table T of items i, each with a weight wi, e.g., people and their salary. In a general preprocessing step for estimating arbitrary subset sums, we assign each item a random priority depending on its weight. Suppose we want to estimate the sum of an arbitrary subset I ⊆ T. For any q 2, considering only the q highest priority items from I, we obtain an unbiased estimator of the sum whose relative standard deviation is O(1/√q). Thus to get an expected approximation factor of 1 ± ε, it suffices to consider O(1/±ε2) items from I. Our estimator needs no knowledge of the number of items in the subset I, but we can also estimate that number if we want to estimate averages.The above scheme performs the same role as the on-line aggregation of Hellerstein et al. (SIGMOD'97) but it has the advantage of having expected good performance for any possible sequence of weights. In particular, the performance does not deteriorate in the common case of heavy-tailed weight distributions. This point is illustrated experimentally both with real and synthetic data.We will also show that our approach can be used to improve Cohen's size estimation framework (FOCS'94).