The space complexity of approximating the frequency moments
Journal of Computer and System Sciences
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Finding frequent items in data streams
Proceedings of the VLDB Endowment
A uniform approach to analogies, synonyms, antonyms, and associations
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Distributed language modeling for N-best list re-ranking
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
How many bits are needed to store probabilities for phrase-based translation?
StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Probabilistic counting with randomized storage
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Stream-based randomised language models for SMT
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Online generation of locality sensitive hash signatures
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Sketching techniques for large scale NLP
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Sketch techniques for scaling distributional similarity to the web
GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Smoothing techniques for adaptive online language models: topic tracking in tweet streams
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Multiple-stream language models for statistical machine translation
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Approximate scalable bounded space sketch for large data NLP
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Streaming analysis of discourse participants
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Fast large-scale approximate graph construction for NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch algorithms for estimating point queries in NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
In this paper, we explore a streaming algorithm paradigm to handle large amounts of data for NLP problems. We present an efficient low-memory method for constructing high-order approximate n-gram frequency counts. The method is based on a deterministic streaming algorithm which efficiently computes approximate frequency counts over a stream of data while employing a small memory footprint. We show that this method easily scales to billion-word monolingual corpora using a conventional (8 GB RAM) desktop machine. Statistical machine translation experimental results corroborate that the resulting high-n approximate small language model is as effective as models obtained from other count pruning methods.