Sketching Sampled Data Streams

  • Authors:
  • Florin Rusu;Alin Dobra

  • Affiliations:
  • -;-

  • Venue:
  • ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sampling is used as a universal method to reduce the running time of computations -- the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream.In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the self-join size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling -- Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines -- and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only $10\%$ or less of the dataset size. This is equivalent to a speed-up factor of at least $10$ when updating the sketch.