Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Authors:
Edith Cohen;Haim Kaplan;Subhabrata Sen
Affiliations:
AT&T Labs--Research, Florham Park, NJ;Tel Aviv University, Tel Aviv, Israel;AT&T Labs--Research, Florham Park, NJ
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 38
Cited 1

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Estimating flow distributions from sampled flow statistics

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Sketch-based change detection: methods, evaluation, and applications

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Constructing a text corpus for inexact duplicate detection

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Reversible sketches for efficient and accurate change detection over network data streams

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
A data streaming algorithm for estimating subpopulation flow size distribution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Estimating arbitrary subset sums with few probes

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
What's new: finding significant differences in network data streams

IEEE/ACM Transactions on Networking (TON)
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Weighted random sampling with a reservoir

Information Processing Letters
Fast range-summable random variables for efficient aggregate estimation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Spatially-decaying aggregation over a network

Journal of Computer and System Sciences
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Why simple hash functions work: exploiting the entropy in a data stream

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Feedback effects between similarity and social influence in online communities

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Enriching network security analysis with time travel

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
On the variance of subset sum estimation

ESA'07 Proceedings of the 15th annual European conference on Algorithms

Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vector-weighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the L1 difference. Sample-based summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing queries that may be specified after the summary was generated. Current designs, however, are geared for data sets where a single scalar weight is associated with each key. We develop a sampling framework based on coordinated weighted samples that is suited for multiple weight assignments and obtain estimators that are orders of magnitude tighter than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data.