Distinct-value synopses for multiset operations

Authors:
Kevin Beyer;Rainer Gemulla;Peter J. Haas;Berthold Reinwald;Yannis Sismanis
Affiliations:
IBM Almaden Research Center, San Jose, CA.;IBM Almaden Research Center, San Jose, CA.;IBM Almaden Research Center, San Jose, CA.;IBM Almaden Research Center, San Jose, CA.;IBM Almaden Research Center, San Jose, CA.
Venue:
Communications of the ACM - A View of Parallel Computing
Year:
2009

Citing 17
Cited 6

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Approximating the number of unique values of an attribute without sorting

Information Systems
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
Randomized algorithms

Randomized algorithms
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Tracking set-expression cardinalities over continuous update streams

The VLDB Journal — The International Journal on Very Large Data Bases
Techniques for Warehousing of Sample Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Sampling time-based sliding windows in bounded space

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Exploiting correlated keywords to improve approximate information filtering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Multidimensional content eXploration

Proceedings of the VLDB Endowment

ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
On power-law distributed balls in bins and its applications to view size estimation

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Towards benefit-based RDF source selection for SPARQL queries

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Sketching and streaming algorithms for processing massive data

XRDS: Crossroads, The ACM Magazine for Students - Big Data
Faster upper bounding of intersection sizes

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Scalable, continuous tracking of tag co-occurrences between short sets using (almost) disjoint tag partitions

Proceedings of the ACM SIGMOD Workshop on Databases and Social Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in the partition. By combining and extending a number of results in the literature, we obtain both suitable synopses and DV estimators. The synopses can be created in parallel, and can be easily combined to yield synopses and DV estimates for "compound" partitions that are created from the base partitions via arbitrary multiset union, intersection, or difference operations. Our synopses can also handle deletions of individual partition elements. We prove that our DV estimators are unbiased, provide error bounds, and show how to select synopsis sizes in order to achieve a desired estimation accuracy. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.