Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Sketch-based change detection: methods, evaluation, and applications
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
What's new: finding significant differences in network data streams
IEEE/ACM Transactions on Networking (TON)
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Summarizing data using bottom-k sketches
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Tighter estimation using bottom k sketches
Proceedings of the VLDB Endowment
Stream sampling for variance-optimal estimation of subset sums
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators. Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple instances: time periods, locations, or snapshots. We are interested in operations, like quantiles and range, that span multiple instances. Subset-sums of these operations are used for applications ranging from planning to anomaly and change detection. Unbiased low-variance estimators are particularly effective as the relative error decreases with aggregation. The Horvitz-Thompson estimator, known to minimize variance for subset-sums over a sample of a single instance, is not optimal for multi-instance operations because it fails to exploit samples which provide partial information on the estimated quantity. We present a general principled methodology for the derivation of optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes.