On synopses for distinct-value estimation under multiset operations

Authors:
Kevin Beyer;Peter J. Haas;Berthold Reinwald;Yannis Sismanis;Rainer Gemulla
Affiliations:
IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;Technische Universität Dresden, Dresden, Germany
Venue:
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Year:
2007

Citing 23
Cited 35

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Approximating the number of unique values of an attribute without sorting

Information Systems
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
Algorithm 708: Significant digit computation of the incomplete beta function ratios

ACM Transactions on Mathematical Software (TOMS)
Randomized algorithms

Randomized algorithms
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Multi-dimensional clustering: a new data layout scheme in DB2

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Empirical evidence concerning AES

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Bitmap algorithms for counting active flows on high speed links

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Tracking set-expression cardinalities over continuous update streams

The VLDB Journal — The International Journal on Very Large Data Bases
Techniques for Warehousing of Sample Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Toward automated large-scale information integration and discovery

Data Management in a Connected World

Sampling time-based sliding windows in bounded space

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient and scalable statistics gathering for large databases in Oracle 11g

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Exploiting correlated keywords to improve approximate information filtering

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Multidimensional content eXploration

Proceedings of the VLDB Endowment
Brighthouse: an analytic data warehouse for ad-hoc queries

Proceedings of the VLDB Endowment
Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

ACM Transactions on Computer Systems (TOCS)
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Statistical structures for Internet-scale data management

The VLDB Journal — The International Journal on Very Large Data Bases
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Bernoulli sampling based (ε, δ)-approximate aggregation in large-scale sensor networks

INFOCOM'10 Proceedings of the 29th conference on Information communications
An efficient features-based processing technique for supergraph queries

Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Efficient temporal keyword search over versioned text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A peer-selection algorithm for information retrieval

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Towards approximate SQL: infobright's approach

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
On multi-column foreign key discovery

Proceedings of the VLDB Endowment
HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
KMV-peer: a robust and adaptive peer-selection algorithm

Proceedings of the fourth ACM international conference on Web search and data mining
Rewriting queries on SPARQL views

Proceedings of the 20th international conference on World wide web
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Beyond simple aggregates: indexing for summary queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimizing data partitioning for data-parallel computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

Proceedings of the forty-third annual ACM symposium on Theory of computing
CRSI: a compact randomized similarity index for set-valued features

Proceedings of the 15th International Conference on Extending Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Sparkler: supporting large-scale matrix factorization

Proceedings of the 16th International Conference on Extending Database Technology
Adaptive log compression for massive log data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
Fast evaluation of iceberg pattern-based aggregate queries

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Indexing for summary queries: Theory and practice

ACM Transactions on Database Systems (TODS)
Arthur-Merlin streaming complexity

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Non-uniformity issues and workarounds in bounded-size sampling

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.