Summarizing data using bottom-k sketches

Authors:
Edith Cohen;Haim Kaplan
Affiliations:
AT&T Labs-Research;Tel Aviv University
Venue:
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Year:
2007

Citing 18
Cited 23

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Making data structures persistent

Journal of Computer and System Sciences - 18th Annual ACM Symposium on Theory of Computing (STOC), May 28-30, 1986
Computational geometry: algorithms and applications

Computational geometry: algorithms and applications
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
A protocol-independent technique for eliminating redundant network traffic

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Maintaining time-decaying stream aggregates

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Efficient estimation algorithms for neighborhood variance and other moments

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Flow sampling under hard resource constraints

Proceedings of the joint international conference on Measurement and modeling of computer systems
Estimating arbitrary subset sums with few probes

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Randomized incremental constructions of three-dimensional convex hulls and planar voronoi diagrams, and approximate range counting

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Computing separable functions via gossip

Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing
Spatially-decaying aggregation over a network

Journal of Computer and System Sciences
Bottom-k sketches: better and more efficient estimation of aggregates

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems

Algorithms and estimators for accurate summarization of internet traffic

Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Optimized union of non-disjoint distributed data sets

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Composable, scalable, and accurate weight summarization of unaggregated data sets

Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Exponential time improvement for min-wise based algorithms

Information and Computation
A comparison of three algorithms for approximating the distance distribution in real-world graphs

TAPAS'11 Proceedings of the First international ICST conference on Theory and practice of algorithms in (computer) systems
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Towards a universal sketch for origin-destination network measurements

NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
Optimal sampling from sliding windows

Journal of Computer and System Sciences
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

SIAM Journal on Computing
Exponential time improvement for min-wise based algorithms

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Don't let the negatives bring you down: sampling from streams of signed updates

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Statistical distortion: consequences of data cleaning

Proceedings of the VLDB Endowment
Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths

Proceedings of the first ACM conference on Online social networks
Call me maybe: understanding nature and risks of sharing mobile numbers on online social networks

Proceedings of the first ACM conference on Online social networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

A Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that depends on the weight of the item and including the k items with smallest rank value. Bottom-k sketches are an alternative to k-mins sketches[9], which consist of the k minimum ranked items in k independent rank assignments,and of min-hash [5] sketches, where hash functions replace random rank assignments. Sketches support approximate aggregations, including weight and selectivity of a subpopulation. Coordinated sketches of multiple subsets over the same ground set support subset-relation queries such as Jaccard similarity or the weight of the union. All-distances sketches are applicable for datasets where items lie in some metric space such as data streams (time) or networks. These sketches compactly encode the respective plain sketches of all neighborhoods of a location. These sketches support queries posed over time windows or neighborhoods and time/spatially decaying aggregates. An important advantage of bottom-k sketches, established in a line of recent work, is much tighter estimators for several basic aggregates. To materialize this benefit, we must adapt traditional k-mins applications to use bottom-k sketches. We propose all-distances bottom-k sketches and develop and analyze data structures that incrementally construct bottom-k sketches and all-distances bottom-k sketches. Another advantage of bottom-k sketches is that when the data is represented explicitly, they can be obtained much more efficiently than k-mins sketches. We show that k-mins sketches can be derived from respective bottom-k sketches, which enables the use of bottom-k sketches with off-the-shelf k-mins estimators. (In fact, we obtain tighter estimators since each bottom-k sketch is adistribution over k-mins sketches).