Using sketches to estimate associations

Authors:
Ping Li;Kenneth W. Church
Affiliations:
Stanford University, Stanford, California;Microsoft Research, Redmond, Washington
Venue:
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Year:
2005

Citing 9
Cited 5

Word association norms, mutual information, and lexicography

Computational Linguistics
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Database Systems: The Complete Book

Database Systems: The Complete Book
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Modern Applied Statistics with S

Modern Applied Statistics with S

Very sparse stable random projections for dimension reduction in lα (0

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Estimators and tail bounds for dimension reduction in lα (0

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Nonlinear estimators and tail bounds for dimension reduction in l1 using Cauchy random projections

COLT'07 Proceedings of the 20th annual conference on Learning theory
Towards a universal sketch for origin-destination network measurements

NPC'11 Proceedings of the 8th IFIP international conference on Network and parallel computing
b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware

Quantified Score

Hi-index	0.01

Visualization

Abstract

We should not have to look at the entire corpus (e.g., the Web) to know if two words are associated or not. A powerful sampling technique called Sketches was originally introduced to remove duplicate Web pages. We generalize sketches to estimate contingency tables and associations, using a maximum likelihood estimator to find the most likely contingency table given the sample, the margins (document frequencies) and the size of the collection. Not unsurprisingly, computational work and statistical accuracy (variance or errors) depend on sampling rate, as will be shown both theoretically and empirically. Sampling methods become more and more important with larger and larger collections. At Web scale, sampling rates as low as 10-4 may suffice.