Probabilistic counting algorithms for data base applications
Journal of Computer and System Sciences
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Packet classification on multiple fields
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
On computing correlated aggregates over continual data streams
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Scalable packet classification
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Database System Implementation
Database System Implementation
IEEE/ACM Transactions on Networking (TON)
Mercator: A scalable, extensible Web crawler
World Wide Web
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
Duplicate Detection for Symbolically Compressed Documents
ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Longest prefix matching using bloom filters
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online duplicate document detection: signature reliability in a dynamic retrieval environment
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Load Shedding for Aggregation Queries over Data Streams
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Farsite: federated, available, and reliable storage for an incompletely trusted environment
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Fast hash table lookup using extended bloom filter: an aid to network processing
Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
TAPER: tiered approach for eliminating redundancy in replica synchronization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Detecting hit shaving in click-through payment schemes
WOEC'98 Proceedings of the 3rd conference on USENIX Workshop on Electronic Commerce - Volume 3
Optimizing Distributed Joins with Bloom Filters
ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Cache-, hash-, and space-efficient bloom filters
Journal of Experimental Algorithmics (JEA)
Improved approximate detection of duplicates for data streams over sliding windows
Journal of Computer Science and Technology
Real-time memory efficient data redundancy removal algorithm
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A multi-attribute data structure with parallel bloom filters for network services
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Proceedings of the 15th International Conference on Extending Database Technology
Don't thrash: how to cache your hash on flash
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
The unparalleled growth and popularity of the Internet coupled with the advent of diverse modern applications such as search engines, on-line transactions, climate warning systems, etc., has catered to an unprecedented expanse in the volume of data stored world-wide. Efficient storage, management, and processing of such massively exponential amount of data has emerged as a central theme of research in this direction. Detection and removal of redundancies and duplicates in real-time from such multi-trillion record-set to bolster resource and compute efficiency constitutes a challenging area of study. The infeasibility of storing the entire data from potentially unbounded data streams, with the need for precise elimination of duplicates calls for intelligent approximate duplicate detection algorithms. The literature hosts numerous works based on the well-known probabilistic bitmap structure, Bloom Filter and its variants. In this paper we propose a novel data structure, Streaming Quotient Filter, (SQF) for efficient detection and removal of duplicates in data streams. SQF intelligently stores the signatures of elements arriving on a data stream, and along with an eviction policy provides near zero false positive and false negative rates. We show that the near optimal performance of SQF is achieved with a very low memory requirement, making it ideal for real-time memory-efficient de-duplication applications having an extremely low false positive and false negative tolerance rates. We present detailed theoretical analysis of the working of SQF, providing a guarantee on its performance. Empirically, we compare SQF to alternate methods and show that the proposed method is superior in terms of memory and accuracy compared to the existing solutions. We also discuss Dynamic SQF for evolving streams and the parallel implementation of SQF.