Rules of thumb for information acquisition from large and redundant data

Authors:
Wolfgang Gatterbauer
Affiliations:
Computer Science and Engineering, University of Washington, Seattle
Venue:
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Year:
2011

Citing 11
Cited 1

Normal limiting distributions for projection and semijoin sizes

SIAM Journal on Discrete Mathematics
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
On the bias of traceroute sampling: or, power-law degree distributions in regular graphs

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Estimating required recall for successful knowledge acquisition from the web

Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Heavy-tailed distributions and multi-keyword queries

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Analytic Combinatorics

Analytic Combinatorics
A probabilistic model of redundancy in information extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Power-Law Distributions in Empirical Data

SIAM Review
Rules of thumb for information acquisition from large and redundant data

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval

Rules of thumb for information acquisition from large and redundant data

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop an abstract model of information acquisition from redundant data. We assume a random sampling process from data which contain information with bias and are interested in the fraction of information we expect to learn as function of (i) the sampled fraction (recall) and (ii) varying bias of information (redundancy distributions). We develop two rules of thumb with varying robustness. We first show that, when information bias follows a Zipf distribution, the 80-20 rule or Pareto principle does surprisingly not hold, and we rather expect to learn less than 40% of the information when randomly sampling 20% of the overall data. We then analytically prove that for large data sets, randomized sampling from power-law distributions leads to "truncated distributions" with the same power-law exponent. This second rule is very robust and also holds for distributions that deviate substantially from a strict power law. We further give one particular family of powerlaw functions that remain completely invariant under sampling. Finally, we validate our model with two large Web data sets: link distributions to web domains and tag distributions on delicious.com.