Efficient sampling of non-strict turnstile data streams

  • Authors:
  • Neta Barkay;Ely Porat;Bar Shalem

  • Affiliations:
  • Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel;Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel;Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel

  • Venue:
  • FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the problem of generating a large sample from a data stream of elements (i,v), where the sample consists of pairs (i,Ci) for Ci=∑(i,v)∈streamv. We consider strict turnstile streams and general non-strict turnstile streams, in which Ci may be negative. Our sample is useful for approximating both forward and inverse distribution statistics, within an additive error ε and provable success probability 1−δ. Our sampling method improves by an order of magnitude the known processing time of each stream element, a crucial factor in data stream applications, thereby providing a feasible solution to the problem. For example, for a sample of size O(ε−2 log(1/δ)) in non-strict streams, our solution requires O((loglog(1/ε))2+(loglog(1/δ)) 2) operations per stream element, whereas the best previous solution requires O(ε−2 log2(1/δ)) evaluations of a fully independent hash function per element. We achieve this improvement by constructing an efficient K-elements recovery structure from which K elements can be extracted with probability 1−δ. Our structure enables our sampling algorithm to run on distributed systems and extract statistics on the difference between streams.