Approximate scalable bounded space sketch for large data NLP

Authors:
Amit Goyal;Hal Daumé, III
Affiliations:
University of Maryland, College Park, MD;University of Maryland, College Park, MD
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 28
Cited 4

Contextual correlates of synonymy

Communications of the ACM
Placing search in context: the concept revisited

ACM Transactions on Information Systems (TOIS)
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Measuring praise and criticism: Inference of semantic orientation from association

ACM Transactions on Information Systems (TOIS)
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Finding frequent items in data streams

Theoretical Computer Science - Special issue on automata, languages and programming
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Non-projective dependency parsing using spanning tree algorithms

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Statistical analysis of sketch estimators

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Finding frequent items in data streams

Proceedings of the VLDB Endowment
A uniform approach to analogies, synonyms, antonyms, and associations

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A study on similarity and relatedness using distributional and WordNet-based approaches

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Streaming for large scale NLP: language modeling

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Probabilistic counting with randomized storage

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Semi-supervised learning of dependency parsers using generalized expectation criteria

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Stream-based randomised language models for SMT

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Hash Kernels for Structured Data

The Journal of Machine Learning Research
Stream-based translation models for statistical machine translation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Online generation of locality sensitive hash signatures

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Sketching techniques for large scale NLP

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Sketch techniques for scaling distributional similarity to the web

GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Using universal linguistic knowledge to guide grammar induction

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Popularity is everything: a new approach to protecting passwords from statistical-guessing attacks

HotSec'10 Proceedings of the 5th USENIX conference on Hot topics in security
Covariance in Unsupervised Learning of Probabilistic Grammars

The Journal of Machine Learning Research

Fast large-scale approximate graph construction for NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch algorithms for estimating point queries in NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Synergistic methods for using language in robotics

Proceedings of the Workshop on Performance Metrics for Intelligent Systems
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply Count-Min sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).