Approximating the number of unique values of an attribute without sorting
Information Systems
Physical database design for relational databases
ACM Transactions on Database Systems (TODS)
Processing aggregate relational queries with hard time constraints
SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
A linear-time probabilistic counting algorithm for database applications
ACM Transactions on Database Systems (TODS)
On estimating the size of projections
ICDT '90 Proceedings of the third international conference on database theory on Database theory
Information Processing and Management: an International Journal - Special issue on Informetrics
Randomized algorithms
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions
Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Optimal and approximate computation of summary statistics for range aggregates
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Modeling high-dimensional index structures using sampling
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Sampling algorithms: lower bounds and applications
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automating Statistics Management for Query Optimizers
IEEE Transactions on Knowledge and Data Engineering
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
Clustering Data Streams: Theory and Practice
IEEE Transactions on Knowledge and Data Engineering
Comparing Data Streams Using Hamming Norms (How to Zero In)
IEEE Transactions on Knowledge and Data Engineering
A Pareto model for OLAP view size estimation
CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
Processing set expressions over continuous update streams
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
SIA: secure information aggregation in sensor networks
Proceedings of the 1st international conference on Embedded networked sensor systems
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Modeling correlations in web traces and implications for designing replacement policies
Computer Networks: The International Journal of Computer and Telecommunications Networking
Tracking set-expression cardinalities over continuous update streams
The VLDB Journal — The International Journal on Very Large Data Bases
Maintaining Implicated Statistics in Constrained Environments
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Join-distinct aggregate estimation over update streams
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Relational confidence bounds are easy with the bootstrap
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Towards estimating the number of distinct value combinations for a set of attributes
Proceedings of the 14th ACM international conference on Information and knowledge management
Estimating nested selectivity in object-oriented and object-relational databases
Information and Software Technology
Cardinality estimation using sample views with quality assurance
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient Approximate Query Processing in Peer-to-Peer Networks
IEEE Transactions on Knowledge and Data Engineering
Comparing data streams using Hamming norms (how to zero in)
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Estimating the output cardinality of partial preaggregation with a measure of clusteredness
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Analytic-based estimation of query result sizes
AIKED'05 Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases
Processing top k queries from samples
CoNEXT '06 Proceedings of the 2006 ACM CoNEXT conference
SIA: Secure information aggregation in sensor networks
Journal of Computer Security - Special Issue on Security of Ad-hoc and Sensor Networks
Testing symmetric properties of distributions
STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient and scalable statistics gathering for large databases in Oracle 11g
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Confidence bounds for sampling-based group by estimates
ACM Transactions on Database Systems (TODS)
Distinct value estimation on peer-to-peer networks
Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Tagmark: reliable estimations of RFID tags for business processes
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Processing top-k queries from samples
Computer Networks: The International Journal of Computer and Telecommunications Networking
Efficiently Handling Dynamics in Distributed Link Based Authority Analysis
WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Efficiently approximating query optimizer plan diagrams
Proceedings of the VLDB Endowment
Sublinear Algorithms for Approximating String Compressibility
APPROX '07/RANDOM '07 Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization. Algorithms and Techniques
The design of a query monitoring system
ACM Transactions on Database Systems (TODS)
The average-case complexity of counting distinct elements
Proceedings of the 12th International Conference on Database Theory
Sampling-based estimators for subset-based queries
The VLDB Journal — The International Journal on Very Large Data Bases
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Preventing bad plans by bounding the impact of cardinality estimation errors
Proceedings of the VLDB Endowment
Correlation maps: a compressed access method for exploiting soft functional dependencies
Proceedings of the VLDB Endowment
Histograms reloaded: the merits of bucket diversity
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
CORADD: correlation aware database designer for materialized views and indexes
Proceedings of the VLDB Endowment
HADI: Mining Radii of Large Graphs
ACM Transactions on Knowledge Discovery from Data (TKDD)
Proceedings of the forty-third annual ACM symposium on Theory of computing
Compression aware physical database design
Proceedings of the VLDB Endowment
On approximation algorithms for data mining applications
Efficient Approximation and Online Algorithms
On power-law distributed balls in bins and its applications to view size estimation
ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Space-efficient estimation of statistics over sub-sampled streams
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Sort-sharing-aware query processing
The VLDB Journal — The International Journal on Very Large Data Bases
Testing Symmetric Properties of Distributions
SIAM Journal on Computing
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Efficient XQuery evaluation of grouping conditions with duplicate removals
XSym'07 Proceedings of the 5th international conference on Database and XML Technologies
Estimating sum by weighted sampling
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Estimating duplication by content-based sampling
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.