Sketch techniques for scaling distributional similarity to the web

Authors:
Amit Goyal;Jagadeesh Jagarlamudi;Hal Daumé, III;Suresh Venkatasubramanian
Affiliations:
University of Utah, Salt Lake City, UT;University of Utah, Salt Lake City, UT;University of Utah, Salt Lake City, UT;University of Utah, Salt Lake City, UT
Venue:
GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Year:
2010

Citing 14
Cited 2

Contextual correlates of synonymy

Communications of the ACM
Placing search in context: the concept revisited

ACM Transactions on Information Systems (TOIS)
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Statistical analysis of sketch estimators

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Finding frequent items in data streams

Proceedings of the VLDB Endowment
A uniform approach to analogies, synonyms, antonyms, and associations

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A study on similarity and relatedness using distributional and WordNet-based approaches

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Streaming for large scale NLP: language modeling

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Probabilistic counting with randomized storage

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Stream-based randomised language models for SMT

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Sketching techniques for large scale NLP

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop

Efficient online locality sensitive hashing via reservoir counting

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Approximate scalable bounded space sketch for large data NLP

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a memory, space, and time efficient framework to scale distributional similarity to the web. We exploit sketch techniques, especially the Count-Min sketch, which approximates the frequency of an item in the corpus without explicitly storing the item itself. These methods use hashing to deal with massive amounts of the streaming text. We store all item counts computed from 90 GB of web data in just 2 billion counters (8 GB main memory) of CM sketch. Our method returns semantic similarity between word pairs in O(K) time and can compute similarity between any word pairs that are stored in the sketch. In our experiments, we show that our framework is as effective as using the exact counts.