Towards estimation error guarantees for distinct values

Authors:
Moses Charikar;Surajit Chaudhuri;Rajeev Motwani;Vivek Narasayya
Affiliations:
Department of Computer Science, Gates 4B, Stanford University, Stanford, CA;Microsoft Research, One Microsoft Way, Redmond, WA;Department of Computer Science, Gates 4B, Stanford University, Stanford, CA;Microsoft Research, One Microsoft Way, Redmond, WA
Venue:
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2000

Citing 14
Cited 62

Approximating the number of unique values of an attribute without sorting

Information Systems
Physical database design for relational databases

ACM Transactions on Database Systems (TODS)
Processing aggregate relational queries with hard time constraints

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
On estimating the size of projections

ICDT '90 Proceedings of the third international conference on database theory on Database theory
Anatomy of the generalized inverse Gaussian-Poisson distribution with special applications to bibliometric studies

Information Processing and Management: an International Journal - Special issue on Informetrics
Randomized algorithms

Randomized algorithms
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Optimal and approximate computation of summary statistics for range aggregates

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modeling high-dimensional index structures using sampling

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Sampling algorithms: lower bounds and applications

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automating Statistics Management for Query Optimizers

IEEE Transactions on Knowledge and Data Engineering
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Comparing Data Streams Using Hamming Norms (How to Zero In)

IEEE Transactions on Knowledge and Data Engineering
A Pareto model for OLAP view size estimation

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
Processing set expressions over continuous update streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
SIA: secure information aggregation in sensor networks

Proceedings of the 1st international conference on Embedded networked sensor systems
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Modeling correlations in web traces and implications for designing replacement policies

Computer Networks: The International Journal of Computer and Telecommunications Networking
Tracking set-expression cardinalities over continuous update streams

The VLDB Journal — The International Journal on Very Large Data Bases
Maintaining Implicated Statistics in Constrained Environments

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Join-distinct aggregate estimation over update streams

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Relational confidence bounds are easy with the bootstrap

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Towards estimating the number of distinct value combinations for a set of attributes

Proceedings of the 14th ACM international conference on Information and knowledge management
Estimating nested selectivity in object-oriented and object-relational databases

Information and Software Technology
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient Approximate Query Processing in Peer-to-Peer Networks

IEEE Transactions on Knowledge and Data Engineering
Comparing data streams using Hamming norms (how to zero in)

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Estimating the output cardinality of partial preaggregation with a measure of clusteredness

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Analytic-based estimation of query result sizes

AIKED'05 Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases
Processing top k queries from samples

CoNEXT '06 Proceedings of the 2006 ACM CoNEXT conference
SIA: Secure information aggregation in sensor networks

Journal of Computer Security - Special Issue on Security of Ad-hoc and Sensor Networks
Testing symmetric properties of distributions

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Sampling time-based sliding windows in bounded space

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient and scalable statistics gathering for large databases in Oracle 11g

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Confidence bounds for sampling-based group by estimates

ACM Transactions on Database Systems (TODS)
Distinct value estimation on peer-to-peer networks

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Tagmark: reliable estimations of RFID tags for business processes

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Processing top-k queries from samples

Computer Networks: The International Journal of Computer and Telecommunications Networking
Efficiently Handling Dynamics in Distributed Link Based Authority Analysis

WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Efficiently approximating query optimizer plan diagrams

Proceedings of the VLDB Endowment
Sublinear Algorithms for Approximating String Compressibility

APPROX '07/RANDOM '07 Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization. Algorithms and Techniques
The design of a query monitoring system

ACM Transactions on Database Systems (TODS)
The average-case complexity of counting distinct elements

Proceedings of the 12th International Conference on Database Theory
Sampling-based estimators for subset-based queries

The VLDB Journal — The International Journal on Very Large Data Bases
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Preventing bad plans by bounding the impact of cardinality estimation errors

Proceedings of the VLDB Endowment
Correlation maps: a compressed access method for exploiting soft functional dependencies

Proceedings of the VLDB Endowment
Histograms reloaded: the merits of bucket diversity

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
CORADD: correlation aware database designer for materialized views and indexes

Proceedings of the VLDB Endowment
HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

Proceedings of the forty-third annual ACM symposium on Theory of computing
Compression aware physical database design

Proceedings of the VLDB Endowment
On approximation algorithms for data mining applications

Efficient Approximation and Online Algorithms
On power-law distributed balls in bins and its applications to view size estimation

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Space-efficient estimation of statistics over sub-sampled streams

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Sort-sharing-aware query processing

The VLDB Journal — The International Journal on Very Large Data Bases
Testing Symmetric Properties of Distributions

SIAM Journal on Computing
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Efficient XQuery evaluation of grouping conditions with duplicate removals

XSym'07 Proceedings of the 5th international conference on Database and XML Technologies
Estimating sum by weighted sampling

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Estimating duplication by content-based sampling

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.