Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Word association norms, mutual information, and lexicography
Computational Linguistics
Elements of information theory
Elements of information theory
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Dynamic itemset counting and implication rules for market basket data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Beyond market baskets: generalizing association rules to correlations
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Fast discovery of association rules
Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Fast algorithms for projected clustering
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Online association rule mining
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A new method for similarity indexing of market basket data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
A small approximately min-wise independent family of hash functions
Journal of Algorithms
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval
Managing Gigabytes: Compressing and Indexing Documents and Images
Managing Gigabytes: Compressing and Indexing Documents and Images
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
One-Pass Wavelet Decompositions of Data Streams
IEEE Transactions on Knowledge and Data Engineering
On the sample size of k-restricted min-wise independent permutations and other k-wise distributions
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Database-friendly random projections: Johnson-Lindenstrauss with binary coins
Journal of Computer and System Sciences - Special issu on PODS 2001
Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Convex Optimization
Very sparse random projections
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Database Systems: The Complete Book
Database Systems: The Complete Book
Nonlinear estimators and tail bounds for dimension reduction in l1 using Cauchy random projections
COLT'07 Proceedings of the 20th annual conference on Learning theory
Improving random projections using marginal information
COLT'06 Proceedings of the 19th annual conference on Learning Theory
Approximating a gram matrix for improved kernel-based learning
COLT'05 Proceedings of the 18th annual conference on Learning Theory
Estimating rates of rare events at multiple resolutions
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Very sparse stable random projections for dimension reduction in lα (0
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Estimators and tail bounds for dimension reduction in lα (0
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Mining recommendations from the web
Proceedings of the 2008 ACM conference on Recommender systems
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Click-through prediction for news queries
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Nonlinear estimators and tail bounds for dimension reduction in l1 using Cauchy random projections
COLT'07 Proceedings of the 20th annual conference on Learning theory
Proceedings of the 19th international conference on World wide web
UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Optimizing content freshness of relations extracted from the web using keyword search
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Sketching techniques for large scale NLP
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Theory and applications of b-bit minwise hashing
Communications of the ACM
Space efficiencies in discourse modeling via conditional random sampling
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Hi-index | 0.02 |
We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketch-based algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do better by taking advantage of the margins (also known as document frequencies). The proposed method cuts the errors roughly in half over Broder's sketches.