Approximately detecting duplicates for streaming data using stable bloom filters

Authors:
Fan Deng;Davood Rafiei
Affiliations:
University of Alberta, Edmonton, Alberta, Canada;University of Alberta, Edmonton, Alberta, Canada
Venue:
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Year:
2006

Citing 24
Cited 24

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems (TODS)
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Database System Implementation

Database System Implementation
Compressed bloom filters

IEEE/ACM Transactions on Networking (TON)
Mercator: A scalable, extensible Web crawler

World Wide Web
Maintaining time-decaying stream aggregates

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
Comparing Data Streams Using Hamming Norms (How to Zero In)

IEEE Transactions on Knowledge and Data Engineering
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online Amnesic Approximation of Streaming Time Series

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Load Shedding for Aggregation Queries over Data Streams

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Monitoring streams: a new class of data management applications

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Load shedding in a data stream manager

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Finding duplicates in a data stream

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Optimized union of non-disjoint distributed data sets

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Improved approximate detection of duplicates for data streams over sliding windows

Journal of Computer Science and Technology
Dynamically Maintaining Duplicate-Insensitive and Time-Decayed Sum Using Time-Decaying Bloom Filter

ICA3PP '09 Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing
"Same, Same but Different" A Survey on Duplicate Detection Methods for Situation Awareness

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II
Receiver-oriented design of Bloom filters for data-centric routing

Computer Networks: The International Journal of Computer and Telecommunications Networking
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Real-time approximate Range Motif discovery & data redundancy removal algorithm

Proceedings of the 14th International Conference on Extending Database Technology
A Generalized Bloom Filter to Secure Distributed Network Applications

Computer Networks: The International Journal of Computer and Telecommunications Networking
Query by document via a decomposition-based two-level retrieval approach

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
One is enough: distributed filtering for duplicate elimination

Proceedings of the 20th ACM international conference on Information and knowledge management
CoDet: sentence-based containment detection in news corpora

Proceedings of the 20th ACM international conference on Information and knowledge management
Cardinality computing: a new step towards fully representing multi-sets by bloom filters

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Time-decaying bloom filters for efficient middle-tier data management

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part III
A comparative study of cuckoo search and bat algorithm for Bloom filter optimisation in spam filtering

International Journal of Bio-Inspired Computation
Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Proceedings of the 15th International Conference on Extending Database Technology
An approximate duplicate elimination in RFID data streams

Data & Knowledge Engineering
Approximate membership query over time-decaying windows for event stream processing

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Duplicate detection in pay-per-click streams using temporal stateful Bloom filters

International Journal of Data Analysis Techniques and Strategies
Inferential time-decaying Bloom filters

Proceedings of the 16th International Conference on Extending Database Technology
Bloofi: a hierarchical Bloom filter index with applications to distributed data provenance

Proceedings of the 2nd International Workshop on Cloud Intelligence
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams

Proceedings of the VLDB Endowment
A locality-aware memory hierarchy for energy-efficient GPU architectures

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
TWINS: Efficient time-windowed in-network joins for sensor networks

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time effciency when a fixed small space and an acceptable false positive rate are given.