Leveraging discarded samples for tighter estimation of multiple-set aggregates

Authors:
Edith Cohen;Haim Kaplan
Affiliations:
AT&T Labs-Research, Florham Park, NJ, USA;Tel Aviv University, Tel Aviv, Israel
Venue:
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Year:
2009

Citing 36
Cited 3

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
An Approximate L1-Difference Algorithm for Massive Data Streams

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Peer-to-peer information retrieval using self-organizing semantic overlay networks

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient estimation algorithms for neighborhood variance and other moments

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Constructing a text corpus for inexact duplicate detection

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Randomized incremental constructions of three-dimensional convex hulls and planar voronoi diagrams, and approximate range counting

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Maintaining time-decaying stream aggregates

Journal of Algorithms
Computing separable functions via gossip

Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Spatially-decaying aggregation over a network

Journal of Computer and System Sciences
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Estimating Aggregates over Multiple Sets

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
On the variance of subset sum estimation

ESA'07 Proceedings of the 15th annual European conference on Algorithms

Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
On multi-column foreign key discovery

Proceedings of the VLDB Endowment
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Many datasets, including market basket data, text or hypertext documents, and events recorded in different locations or time periods, can be modeled as a collection of sets over a ground set of keys. Common queries over such data, including similarity or association rules are represented as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and memberships in particular sets. On massive data sets, exact computation of such aggregates can be inefficient or infeasible, and therefore, approximate queries are processed over sketches of the sets. Sketches based on coordinated random samples are scalable and flexible and well suited for many applications. Queries are resolved by producing a sketch of the union of sets used in the predicate from the sketches of these sets and then applying an estimator to this union-sketch. We derive novel tighter (unbiased) estimators that leverage sampled keys that are present in the union of applicable sketches but excluded from the union sketch. We establish analytically that our estimators dominate estimators applied to the union-sketch for all queries and data sets. Empirical evaluation on synthetic and real data reveals that on typical applications we can expect a 25% 4 fold reduction in estimation error.