Get the most out of your sample: optimal unbiased estimators using partial information

Authors:
Edith Cohen;Haim Kaplan
Affiliations:
AT&T Labs-Research, Florham Park, NJ, USA;Tel Aviv University, Tel Aviv, Israel
Venue:
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2011

Citing 19
Cited 0

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Sketch-based change detection: methods, evaluation, and applications

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
What's new: finding significant differences in network data streams

IEEE/ACM Transactions on Networking (TON)
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators. Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple instances: time periods, locations, or snapshots. We are interested in operations, like quantiles and range, that span multiple instances. Subset-sums of these operations are used for applications ranging from planning to anomaly and change detection. Unbiased low-variance estimators are particularly effective as the relative error decreases with aggregation. The Horvitz-Thompson estimator, known to minimize variance for subset-sums over a sample of a single instance, is not optimal for multi-instance operations because it fails to exploit samples which provide partial information on the estimated quantity. We present a general principled methodology for the derivation of optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes.