A linear-time probabilistic counting algorithm for database applications
ACM Transactions on Database Systems (TODS)
On the security of pay-per-click and other Web advertising schemes
WWW '99 Proceedings of the eighth international conference on World Wide Web
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Database System Implementation
Database System Implementation
Maintaining time-decaying stream aggregates
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
Issues in data stream management
ACM SIGMOD Record
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Identification of Fuzzy Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate counts and quantiles over sliding windows
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Time-Decaying Bloom Filters for Data Streams with Skewed Distributions
RIDE '05 Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Time-decaying sketches for sensor data aggregation
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Outlier detection over sliding windows for probabilistic data streams
Journal of Computer Science and Technology
Proceedings of the 15th International Conference on Extending Database Technology
Duplicate detection in pay-per-click streams using temporal stateful Bloom filters
International Journal of Data Analysis Techniques and Strategies
Inferential time-decaying Bloom filters
Proceedings of the 16th International Conference on Extending Database Technology
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O(√G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.