Sketching techniques for large scale NLP

  • Authors:
  • Amit Goyal;Jagadeesh Jagarlamudi;Hal Daumé, III;Suresh Venkatasubramanian

  • Affiliations:
  • University of Utah;University of Utah;University of Utah;University of Utah

  • Venue:
  • WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in the context of streaming data. We explore sketch techniques, especially the Count-Min Sketch, which approximates the frequency of a word pair in the corpus without explicitly storing the word pairs themselves. We use the idea of a conservative update with the Count-Min Sketch to reduce the average relative error of its approximate counts by a factor of two. We show that it is possible to store all words and word pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB RAM). The number of these counters is up to 30 times less than the stream size which is a big memory and space gain. In Semantic Orientation experiments, the PMI scores computed from 2 billion counters are as effective as exact PMI scores.