Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Packet classification on multiple fields
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
On computing correlated aggregates over continual data streams
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Scalable packet classification
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Database System Implementation
Database System Implementation
IEEE/ACM Transactions on Networking (TON)
Mercator: A scalable, extensible Web crawler
World Wide Web
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Duplicate Detection for Symbolically Compressed Documents
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Longest prefix matching using bloom filters
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Farsite: federated, available, and reliable storage for an incompletely trusted environment
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
On biased reservoir sampling in the presence of stream evolution
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data Streams: Models and Algorithms (Advances in Database Systems)
Data Streams: Models and Algorithms (Advances in Database Systems)
TAPER: tiered approach for eliminating redundancy in replica synchronization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Detecting hit shaving in click-through payment schemes
WOEC'98 Proceedings of the 3rd conference on USENIX Workshop on Electronic Commerce - Volume 3
Optimizing Distributed Joins with Bloom Filters
ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Cache-, hash-, and space-efficient bloom filters
Journal of Experimental Algorithmics (JEA)
Improved approximate detection of duplicates for data streams over sliding windows
Journal of Computer Science and Technology
Real-time memory efficient data redundancy removal algorithm
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A multi-attribute data structure with parallel bloom filters for network services
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Bloofi: a hierarchical Bloom filter index with applications to distributed data provenance
Proceedings of the 2nd International Workshop on Cloud Intelligence
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
With the explosion of information stored world-wide, data intensive computing has emerged as a central area of research. Efficient management and processing of this massively exponential amount of data from diverse sources, such as telecommunication call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc., has become a necessity. Removing redundancy from such huge (multi-billion records) datasets results in resource and compute efficiency for downstream processing and constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios, for precise identification and elimination of duplicates from the unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) [13] address this problem to a certain extent. However, SBF suffers from a high false negative rate and slow convergence rate, thereby rendering it inefficient for applications with low false negative rate tolerance. In this paper, we present a novel reservoir sampling based Bloom filter (RSBF) technique, based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in data streams. Using detailed theoretical analysis we prove analytical bounds on its false positive rate, false negative rate and convergence rates with low memory requirements. We show that RSBF outperforms SBF in terms of false negative rates and convergence rates while consuming the same amount of memory. Using empirical analysis on real-world datasets (3 million records) and synthetic datasets with around 1 billion records, we demonstrate upto 2× improvement in false negative rate with better convergence rates as compared to SBF, while maintaining comparable false positive rates. To the best of our knowledge, this is the first attempt to integrate reservoir sampling method with Bloom filters for deduplication in streaming scenarios.