Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice
ACM Transactions on Computer Systems (TOCS)
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Duplicate detection in click streams
WWW '05 Proceedings of the 14th international conference on World Wide Web
Approximately detecting duplicates for streaming data using stable bloom filters
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Information Processing Letters
Efficiently Filtering Duplicates over Distributed Data Streams
CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 04
Optimized union of non-disjoint distributed data sets
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
IEEE Transactions on Knowledge and Data Engineering
Tracking long duration flows in network traffic
INFOCOM'10 Proceedings of the 29th conference on Information communications
Cardinality estimation and dynamic length adaptation for Bloom filters
Distributed and Parallel Databases
Filtering duplicate items over distributed data streams
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Inferential time-decaying Bloom filters
Proceedings of the 16th International Conference on Extending Database Technology
Hi-index | 0.00 |
The growth of online services has created the need for duplicate elimination in high-volume streams of events. The sheer volume of data in applications such as pay-per-click clickstream processing, RSS feed syndication and notification services in social sites such Twitter and Facebook makes traditional centralized solutions hard to scale. In this paper, we propose an approach based on distributed filtering. To this end, we introduce a suite of distributed Bloom filters that exploit different ways of partitioning the event space. To address the continuous nature of event delivery, the filters are extended to support sliding window semantics. Moreover, we examine locality-related tradeoffs and propose a tree-based architecture to allow for duplicate elimination across geographic locations. We cast the design space and present experimental results that demonstrate the pros and cons of our various solutions in different settings.