Contextual correlates of synonymy
Communications of the ACM
Placing search in context: the concept revisited
ACM Transactions on Information Systems (TOIS)
New directions in traffic measurement and accounting
Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
An improved data stream summary: the count-min sketch and its applications
Journal of Algorithms
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Statistical analysis of sketch estimators
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Finding frequent items in data streams
Proceedings of the VLDB Endowment
A uniform approach to analogies, synonyms, antonyms, and associations
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A study on similarity and relatedness using distributional and WordNet-based approaches
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Streaming for large scale NLP: language modeling
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Probabilistic counting with randomized storage
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Stream-based randomised language models for SMT
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Web-scale distributional similarity and entity set expansion
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Sketching techniques for large scale NLP
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Efficient online locality sensitive hashing via reservoir counting
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Approximate scalable bounded space sketch for large data NLP
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Hi-index | 0.00 |
In this paper, we propose a memory, space, and time efficient framework to scale distributional similarity to the web. We exploit sketch techniques, especially the Count-Min sketch, which approximates the frequency of an item in the corpus without explicitly storing the item itself. These methods use hashing to deal with massive amounts of the streaming text. We store all item counts computed from 90 GB of web data in just 2 billion counters (8 GB main memory) of CM sketch. Our method returns semantic similarity between word pairs in O(K) time and can compute similarity between any word pairs that are stored in the sketch. In our experiments, we show that our framework is as effective as using the exact counts.