Normal limiting distributions for projection and semijoin sizes
SIAM Journal on Discrete Mathematics
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
On the bias of traceroute sampling: or, power-law degree distributions in regular graphs
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
Estimating required recall for successful knowledge acquisition from the web
Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Heavy-tailed distributions and multi-keyword queries
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Analytic Combinatorics
A probabilistic model of redundancy in information extraction
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Power-Law Distributions in Empirical Data
SIAM Review
Rules of thumb for information acquisition from large and redundant data
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Rules of thumb for information acquisition from large and redundant data
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Hi-index | 0.00 |
We develop an abstract model of information acquisition from redundant data. We assume a random sampling process from data which contain information with bias and are interested in the fraction of information we expect to learn as function of (i) the sampled fraction (recall) and (ii) varying bias of information (redundancy distributions). We develop two rules of thumb with varying robustness. We first show that, when information bias follows a Zipf distribution, the 80-20 rule or Pareto principle does surprisingly not hold, and we rather expect to learn less than 40% of the information when randomly sampling 20% of the overall data. We then analytically prove that for large data sets, randomized sampling from power-law distributions leads to "truncated distributions" with the same power-law exponent. This second rule is very robust and also holds for distributions that deviate substantially from a strict power law. We further give one particular family of powerlaw functions that remain completely invariant under sampling. Finally, we validate our model with two large Web data sets: link distributions to web domains and tag distributions on delicious.com.