Towards estimation error guarantees for distinct values

  • Authors:
  • Moses Charikar;Surajit Chaudhuri;Rajeev Motwani;Vivek Narasayya

  • Affiliations:
  • Department of Computer Science, Gates 4B, Stanford University, Stanford, CA;Microsoft Research, One Microsoft Way, Redmond, WA;Department of Computer Science, Gates 4B, Stanford University, Stanford, CA;Microsoft Research, One Microsoft Way, Redmond, WA

  • Venue:
  • PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.