Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
A protocol-independent technique for eliminating redundant network traffic
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
An Approximate L1-Difference Algorithm for Massive Data Streams
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Peer-to-peer information retrieval using self-organizing semantic overlay networks
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient estimation algorithms for neighborhood variance and other moments
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Constructing a text corpus for inexact duplicate detection
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Maintaining time-decaying stream aggregates
Journal of Algorithms
Computing separable functions via gossip
Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Spatially-decaying aggregation over a network
Journal of Computer and System Sciences
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Summarizing data using bottom-k sketches
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
Computational Linguistics
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
Stream sampling for variance-optimal estimation of subset sums
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Estimating Aggregates over Multiple Sets
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
On the variance of subset sum estimation
ESA'07 Proceedings of the 15th annual European conference on Algorithms
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
On multi-column foreign key discovery
Proceedings of the VLDB Endowment
Get the most out of your sample: optimal unbiased estimators using partial information
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Hi-index | 0.01 |
Many datasets, including market basket data, text or hypertext documents, and events recorded in different locations or time periods, can be modeled as a collection of sets over a ground set of keys. Common queries over such data, including similarity or association rules are represented as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and memberships in particular sets. On massive data sets, exact computation of such aggregates can be inefficient or infeasible, and therefore, approximate queries are processed over sketches of the sets. Sketches based on coordinated random samples are scalable and flexible and well suited for many applications. Queries are resolved by producing a sketch of the union of sets used in the predicate from the sketches of these sets and then applying an estimator to this union-sketch. We derive novel tighter (unbiased) estimators that leverage sampled keys that are present in the union of applicable sketches but excluded from the union sketch. We establish analytically that our estimators dominate estimators applied to the union-sketch for all queries and data sets. Empirical evaluation on synthetic and real data reveals that on typical applications we can expect a 25% 4 fold reduction in estimation error.