Sketch algorithms for estimating point queries in NLP

Authors:
Amit Goyal;Hal Daumé, III;Graham Cormode
Affiliations:
University of Maryland;University of Maryland;AT&T Labs--Research
Venue:
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Year:
2012

Citing 23
Cited 2

Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Measuring praise and criticism: Inference of semantic orientation from association

ACM Transactions on Information Systems (TOIS)
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Finding frequent items in data streams

Theoretical Computer Science - Special issue on automata, languages and programming
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Statistical analysis of sketch estimators

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Finding frequent items in data streams

Proceedings of the VLDB Endowment
A uniform approach to analogies, synonyms, antonyms, and associations

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Discriminative learning of selectional preference from unlabeled text

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A study on similarity and relatedness using distributional and WordNet-based approaches

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Streaming for large scale NLP: language modeling

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Succinct approximate counting of skewed data

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Probabilistic counting with randomized storage

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Stream-based randomised language models for SMT

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Hash Kernels for Structured Data

The Journal of Machine Learning Research
Improving the use of pseudo-words for evaluating selectional preferences

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Online generation of locality sensitive hash signatures

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Popularity is everything: a new approach to protecting passwords from statistical-guessing attacks

HotSec'10 Proceedings of the 5th USENIX conference on Hot topics in security
Approximate scalable bounded space sketch for large data NLP

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Fast large-scale approximate graph construction for NLP

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Indexing Word Sequences for Ranked Retrieval

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many NLP tasks rely on accurate statistics from large corpora. Tracking complete statistics is memory intensive, so recent work has proposed using compact approximate "sketches" of frequency distributions. We describe 10 sketch methods, including existing and novel variants. We compare and study the errors (over-estimation and underestimation) made by the sketches. We evaluate several sketches on three important NLP problems. Our experiments show that one sketch performs best for all the three tasks.