The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Ripple joins for online aggregation
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Finding Frequent Items in Data Streams
ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Gossip-Based Computation of Aggregate Information
FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Tabulation based 4-universal hashing with applications to second moment estimation
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Approximation techniques for spatial data
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
An improved data stream summary: the count-min sketch and its applications
Journal of Algorithms
Sketching streams through the net: distributed approximate query tracking
VLDB '05 Proceedings of the 31st international conference on Very large data bases
End-biased Samples for Join Cardinality Estimation
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Fast range-summable random variables for efficient aggregate estimation
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Practical algorithms for tracking database join sizes
FSTTCS '05 Proceedings of the 25th international conference on Foundations of Software Technology and Theoretical Computer Science
Finding frequent items in data streams
Proceedings of the VLDB Endowment
Methods for finding frequent items in data streams
The VLDB Journal — The International Journal on Very Large Data Bases
Sketching techniques for large scale NLP
WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Sketch techniques for scaling distributional similarity to the web
GEMS '10 Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics
Approximate scalable bounded space sketch for large data NLP
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Fast large-scale approximate graph construction for NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Sketch algorithms for estimating point queries in NLP
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Hi-index | 0.01 |
Sketching techniques can provide approximate answers to aggregate queries either for data-streaming or distributed computation. Small space summaries that have linearity properties are required for both types of applications. The prevalent method for analyzing sketches uses moment analysis and distribution independent bounds based on moments. This method produces clean, easy to interpret, theoretical bounds that are especially useful for deriving asymptotic results. However, the theoretical bounds obscure fine details of the behavior of various sketches and they are mostly not indicative of which type of sketches should be used in practice. Moreover, no significant empirical comparison between various sketching techniques has been published, which makes the choice even harder. In this paper, we take a close look at the sketching techniques proposed in the literature from a statistical point of view with the goal of determining properties that indicate the actual behavior and producing tighter confidence bounds. Interestingly, the statistical analysis reveals that two of the techniques, Fast-AGMS and Count-Min, provide results that are in some cases orders of magnitude better than the corresponding theoretical predictions. We conduct an extensive empirical study that compares the different sketching techniques in order to corroborate the statistical analysis with the conclusions we draw from it. The study indicates the expected performance of various sketches, which is crucial if the techniques are to be used by practitioners. The overall conclusion of the study is that Fast-AGMS sketches are, for the full spectrum of problems, either the best, or close to the best, sketching technique. This makes Fast-AGMS sketches the preferred choice irrespective of the situation.