Approximately Detecting Duplicates for Probabilistic Data Streams over Sliding Windows

  • Authors:
  • Xiujun Wang;Hong Shen

  • Affiliations:
  • -;-

  • Venue:
  • PAAP '10 Proceedings of the 2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

A probabilistic data stream $S$ is defined as a sequence of uncertain tuples $,i=1...\infty$, with the semantics that element $t_i$ occurs in the stream with probability $p_i \in (0,1)$. Thus each distinct element $t$, which occurs in tuples of $S$, has an existential probability based on the tuples: $ \in S$. Existing duplicate detection methods for a traditional deterministic data stream can't maintain these existential probabilities for elements in $S$, which is important query information. In this paper, we present a novel data structure, Floating Counter Bloom Filter (FCBF), as an extension of CBF [1], which can maintain these existential probabilities effectively. Based on FCBF, we present an efficient algorithm to approximately detect duplicates for probabilistic data streams over sliding windows. Given a sliding window size $W$ and floating counter number $N$, for any $t$ which occurs in the past sliding window, our method outputs the accurate existential probability of $t$ with probability $1-(1/2)^{ln(2)*N/W}$. Our experimental results on the synthetic data verify the effectiveness of our approach.