Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Approximating the number of unique values of an attribute without sorting
Information Systems
A linear-time probabilistic counting algorithm for database applications
ACM Transactions on Database Systems (TODS)
Randomized algorithms
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Tracking set-expression cardinalities over continuous update streams
The VLDB Journal — The International Journal on Very Large Data Bases
Techniques for Warehousing of Sample Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Exploiting correlated keywords to improve approximate information filtering
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Tighter estimation using bottom k sketches
Proceedings of the VLDB Endowment
Multidimensional content eXploration
Proceedings of the VLDB Endowment
ATLAS: a probabilistic algorithm for high dimensional similarity search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
On power-law distributed balls in bins and its applications to view size estimation
ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Towards benefit-based RDF source selection for SPARQL queries
SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Sketching and streaming algorithms for processing massive data
XRDS: Crossroads, The ACM Magazine for Students - Big Data
Faster upper bounding of intersection sizes
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the ACM SIGMOD Workshop on Databases and Social Networks
Hi-index | 0.00 |
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in the partition. By combining and extending a number of results in the literature, we obtain both suitable synopses and DV estimators. The synopses can be created in parallel, and can be easily combined to yield synopses and DV estimates for "compound" partitions that are created from the base partitions via arbitrary multiset union, intersection, or difference operations. Our synopses can also handle deletions of individual partition elements. We prove that our DV estimators are unbiased, provide error bounds, and show how to select synopsis sizes in order to achieve a desired estimation accuracy. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.