Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
Randomized algorithms
General asymptotic estimates for the coupon collector problem
Journal of Computational and Applied Mathematics
Towards estimation error guarantees for distinct values
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Analysis and performance of inverted data base structures
Communications of the ACM
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
A Pareto Model for OLAP View Size Estimation
Information Systems Frontiers
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total
ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Modeling Skewed Distribution Using Multifractals and the `80-20' Law
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Fast and accurate traffic matrix measurement using adaptive cardinality counting
Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data
A comparison of five probabilistic view-size estimation techniques in OLAP
Proceedings of the ACM tenth international workshop on Data warehousing and OLAP
Distinct-value synopses for multiset operations
Communications of the ACM - A View of Parallel Computing
An optimal algorithm for the distinct elements problem
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Hi-index | 0.00 |
The view size estimation plays an important role in query optimization. It has been observed that many data follow a power law distribution. In this paper, we consider the balls in bins problem where we place balls into N bins when the bin selection probabilities follow a power law distribution. As a generalization to the coupon collector's problem, we address the problem of determining the expected number of balls that need to be thrown in order to have at least one ball in each of the N bins. We prove that $\Theta(\frac{N^\alpha \ln N}{c_N^{\alpha}})$ balls are needed to achieve this where α is the parameter of the power law distribution and $c_N^{\alpha}=\frac{\alpha-1}{\alpha-N^{\alpha-1}}$ for α≠1 and $c_N^{\alpha}=\frac{1}{\ln N}$ for α=1. Next, when fixing the number of balls that are thrown to T, we provide closed form upper and lower bounds on the expected number of bins that have at least one occupant. For n large and α1, we prove that our bounds are tight up to a constant factor of $\left(\frac{\alpha}{\alpha-1}\right)^{1-\frac{1}{\alpha}} \leq e^{1/e} \simeq 1.4$.