Sketching techniques for large scale NLP

Authors:
Amit Goyal;Jagadeesh Jagarlamudi;Hal Daumé, III;Suresh Venkatasubramanian
Affiliations:
University of Utah;University of Utah;University of Utah;University of Utah
Venue:
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Year:
2010

Citing 16
Cited 5

New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Statistical analysis of sketch estimators

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Finding frequent items in data streams

Proceedings of the VLDB Endowment
A uniform approach to analogies, synonyms, antonyms, and associations

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A study on similarity and relatedness using distributional and WordNet-based approaches

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Streaming for large scale NLP: language modeling

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Learning domain-specific information extraction patterns from the Web

IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Probabilistic counting with randomized storage

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Stream-based randomised language models for SMT

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2

Sketch techniques for scaling distributional similarity to the web

GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Generating semantic orientation lexicon using large data and thesaurus

WASSA '11 Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis
Approximate scalable bounded space sketch for large data NLP

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Space efficiencies in discourse modeling via conditional random sampling

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Modeling conservative updates in multi-hash approximate count sketches

Proceedings of the 24th International Teletraffic Congress

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in the context of streaming data. We explore sketch techniques, especially the Count-Min Sketch, which approximates the frequency of a word pair in the corpus without explicitly storing the word pairs themselves. We use the idea of a conservative update with the Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB RAM). The number of these counters is up to 30 times less than the stream size which is a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.