Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Authors:
Sourav Dutta;Souvik Bhattacherjee;Ankur Narang
Affiliations:
IBM Research, New Delhi, India;IBM Research, New Delhi, India;IBM Research, New Delhi, India
Venue:
Proceedings of the 15th International Conference on Extending Database Technology
Year:
2012

Citing 34
Cited 2

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Packet classification on multiple fields

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
On computing correlated aggregates over continual data streams

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Scalable packet classification

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Database System Implementation

Database System Implementation
Compressed bloom filters

IEEE/ACM Transactions on Networking (TON)
Mercator: A scalable, extensible Web crawler

World Wide Web
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Duplicate Detection for Symbolically Compressed Documents

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Longest prefix matching using bloom filters

Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Farsite: federated, available, and reliable storage for an incompletely trusted environment

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximately detecting duplicates for streaming data using stable bloom filters

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
On biased reservoir sampling in the presence of stream evolution

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Data Streams: Models and Algorithms (Advances in Database Systems)

Data Streams: Models and Algorithms (Advances in Database Systems)
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Detecting hit shaving in click-through payment schemes

WOEC'98 Proceedings of the 3rd conference on USENIX Workshop on Electronic Commerce - Volume 3
Optimizing Distributed Joins with Bloom Filters

ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Cache-, hash-, and space-efficient bloom filters

Journal of Experimental Algorithmics (JEA)
Improved approximate detection of duplicates for data streams over sliding windows

Journal of Computer Science and Technology
Real-time memory efficient data redundancy removal algorithm

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A multi-attribute data structure with parallel bloom filters for network services

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Bloofi: a hierarchical Bloom filter index with applications to distributed data provenance

Proceedings of the 2nd International Workshop on Cloud Intelligence
Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the explosion of information stored world-wide, data intensive computing has emerged as a central area of research. Efficient management and processing of this massively exponential amount of data from diverse sources, such as telecommunication call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc., has become a necessity. Removing redundancy from such huge (multi-billion records) datasets results in resource and compute efficiency for downstream processing and constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios, for precise identification and elimination of duplicates from the unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) [13] address this problem to a certain extent. However, SBF suffers from a high false negative rate and slow convergence rate, thereby rendering it inefficient for applications with low false negative rate tolerance. In this paper, we present a novel reservoir sampling based Bloom filter (RSBF) technique, based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in data streams. Using detailed theoretical analysis we prove analytical bounds on its false positive rate, false negative rate and convergence rates with low memory requirements. We show that RSBF outperforms SBF in terms of false negative rates and convergence rates while consuming the same amount of memory. Using empirical analysis on real-world datasets (3 million records) and synthetic datasets with around 1 billion records, we demonstrate upto 2× improvement in false negative rate with better convergence rates as compared to SBF, while maintaining comparable false positive rates. To the best of our knowledge, this is the first attempt to integrate reservoir sampling method with Bloom filters for deduplication in streaming scenarios.