Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Approximating the number of unique values of an attribute without sorting
Information Systems
A linear-time probabilistic counting algorithm for database applications
ACM Transactions on Database Systems (TODS)
Algorithm 708: Significant digit computation of the incomplete beta function ratios
ACM Transactions on Mathematical Software (TOMS)
Randomized algorithms
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator
ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Multi-dimensional clustering: a new data layout scheme in DB2
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Empirical evidence concerning AES
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Bitmap algorithms for counting active flows on high speed links
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Tracking set-expression cardinalities over continuous update streams
The VLDB Journal — The International Journal on Very Large Data Bases
Techniques for Warehousing of Sample Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
The history of histograms (abridged)
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Toward automated large-scale information integration and discovery
Data Management in a Connected World
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient and scalable statistics gathering for large databases in Oracle 11g
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Exploiting correlated keywords to improve approximate information filtering
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Tighter estimation using bottom k sketches
Proceedings of the VLDB Endowment
Multidimensional content eXploration
Proceedings of the VLDB Endowment
Brighthouse: an analytic data warehouse for ad-hoc queries
Proceedings of the VLDB Endowment
ACM Transactions on Computer Systems (TOCS)
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Distinct-value synopses for multiset operations
Communications of the ACM - A View of Parallel Computing
Statistical structures for Internet-scale data management
The VLDB Journal — The International Journal on Very Large Data Bases
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
An optimal algorithm for the distinct elements problem
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Bernoulli sampling based (ε, δ)-approximate aggregation in large-scale sensor networks
INFOCOM'10 Proceedings of the 29th conference on Information communications
An efficient features-based processing technique for supergraph queries
Proceedings of the Fourteenth International Database Engineering & Applications Symposium
Efficient temporal keyword search over versioned text
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A peer-selection algorithm for information retrieval
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Towards approximate SQL: infobright's approach
RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
On multi-column foreign key discovery
Proceedings of the VLDB Endowment
HADI: Mining Radii of Large Graphs
ACM Transactions on Knowledge Discovery from Data (TKDD)
KMV-peer: a robust and adaptive peer-selection algorithm
Proceedings of the fourth ACM international conference on Web search and data mining
Rewriting queries on SPARQL views
Proceedings of the 20th international conference on World wide web
Get the most out of your sample: optimal unbiased estimators using partial information
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Beyond simple aggregates: indexing for summary queries
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimizing data partitioning for data-parallel computing
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Proceedings of the forty-third annual ACM symposium on Theory of computing
CRSI: a compact randomized similarity index for set-valued features
Proceedings of the 15th International Conference on Extending Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Sparkler: supporting large-scale matrix factorization
Proceedings of the 16th International Conference on Extending Database Technology
Adaptive log compression for massive log data
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans
Proceedings of the 8th ACM European Conference on Computer Systems
Fast evaluation of iceberg pattern-based aggregate queries
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Indexing for summary queries: Theory and practice
ACM Transactions on Database Systems (TODS)
Arthur-Merlin streaming complexity
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Non-uniformity issues and workarounds in bounded-size sampling
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques that are designed for use within a flexible and scalable "synopsis warehouse" architecture. In this setting, incoming data is split into partitions and a synopsis is created for each partition; each synopsis can then be used to quickly estimate the number of DVs in its corresponding partition. By combining and extending a number of results in the literature, we obtain both appropriate synopses and novel DV estimators to use in conjunction with these synopses. Our synopses can be created in parallel, and can then be easily combined to yield synopses and DV estimates for arbitrary unions, intersections or differences of partitions. Our synopses can also handle deletions of individual partition elements. We use the theory of order statistics to show that our DV estimators are unbiased, and to establish moment formulas and sharp error bounds. Based on a novel limit theorem, we can exploit results due to Cohen in order to select synopsis sizes when initially designing the warehouse. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches.