Word association norms, mutual information, and lexicography
Computational Linguistics
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Database Systems: The Complete Book
Database Systems: The Complete Book
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Modern Applied Statistics with S
Modern Applied Statistics with S
Very sparse stable random projections for dimension reduction in lα (0
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Estimators and tail bounds for dimension reduction in lα (0
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Nonlinear estimators and tail bounds for dimension reduction in l1 using Cauchy random projections
COLT'07 Proceedings of the 20th annual conference on Learning theory
Towards a universal sketch for origin-destination network measurements
NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
b-bit minwise hashing in practice
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Hi-index | 0.01 |
We should not have to look at the entire corpus (e.g., the Web) to know if two words are associated or not. A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. At Web scale, sampling rates as low as 10-4 may suffice.