Cardinality estimation and dynamic length adaptation for Bloom filters

Authors:
Odysseas Papapetrou;Wolf Siberski;Wolfgang Nejdl
Affiliations:
L3S Research Center, Leibniz Universität Hannover, Hannover, Germany;L3S Research Center, Leibniz Universität Hannover, Hannover, Germany;L3S Research Center, Leibniz Universität Hannover, Hannover, Germany
Venue:
Distributed and Parallel Databases
Year:
2010

Citing 34
Cited 8

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences
Optimal Semijoins for Distributed Database Systems

IEEE Transactions on Software Engineering
Randomized algorithms

Randomized algorithms
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
The state of the art in distributed query processing

ACM Computing Surveys (CSUR)
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Hash-based IP traceback

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Compressed bloom filters

IEEE/ACM Transactions on Networking (TON)
Hashing Methods and Relational Algebra Operations

VLDB '84 Proceedings of the 10th International Conference on Very Large Data Bases
R* Optimizer Validation and Performance Evaluation for Distributed Queries

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Giggle: a framework for constructing scalable replica location services

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Bloomier filter: an efficient data structure for static support lookup tables

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Informed content delivery across adaptive overlay networks

IEEE/ACM Transactions on Networking (TON)
Duplicate detection in click streams

WWW '05 Proceedings of the 14th international conference on World Wide Web
Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
KLEE: a framework for distributed top-k query algorithms

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Enhancing Collaborative Spam Detection with Bloom Filters

ACSAC '06 Proceedings of the 22nd Annual Computer Security Applications Conference
Improving distributed join efficiency with extended bloom filter operations

AINA '07 Proceedings of the 21st International Conference on Advanced Networking and Applications
Bloom histogram: path selectivity estimation for XML data with updates

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Hash-AV: fast virus signature scanning by cache-resident filters

International Journal of Security and Networks
GossipTrust for Fast Reputation Aggregation in Peer-to-Peer Networks

IEEE Transactions on Knowledge and Data Engineering
L-CBF: a low-power, fast counting bloom filter architecture

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing Distributed Joins with Bloom Filters

ICDCIT '08 Proceedings of the 5th International Conference on Distributed Computing and Internet Technology
Efficient peer-to-peer keyword searching

Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Revelation on demand

Distributed and Parallel Databases
XML processing in DHT networks

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Distributed Structural Relaxation of XPath Queries

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Distributed top-k aggregation queries at large

Distributed and Parallel Databases
The Dynamic Bloom Filters

IEEE Transactions on Knowledge and Data Engineering

Understanding bloom filter intersection for lazy address-set disambiguation

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
XStreamCluster: an efficient algorithm for streaming XML data clustering

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Probabilistic threshold join over distributed uncertain data

WAIM'11 Proceedings of the 12th international conference on Web-age information management
One is enough: distributed filtering for duplicate elimination

Proceedings of the 20th ACM international conference on Information and knowledge management
Towards benefit-based RDF source selection for SPARQL queries

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Faster upper bounding of intersection sizes

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
When private set intersection meets big data: an efficient and scalable protocol

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security
TWINS: Efficient time-windowed in-network joins for sensor networks

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bloom filters are extensively used in distributed applications, especially in distributed databases and distributed information systems, to reduce network requirements and to increase performance. In this work, we propose two novel Bloom filter features that are important for distributed databases and information systems. First, we present a new approach to encode a Bloom filter such that its length can be adapted to the cardinality of the set it represents, with negligible overhead with respect to computation and false positive probability. The proposed encoding allows for significant network savings in distributed databases, as it enables the participating nodes to optimize the length of each Bloom filter before sending it over the network, for example, when executing Bloom joins. Second, we show how to estimate the number of distinct elements in a Bloom filter, for situations where the represented set is not materialized. These situations frequently arise in distributed databases, where estimating the cardinality of the represented sets is necessary for constructing an efficient query plan. The estimation is highly accurate and comes with tight probabilistic bounds. For both features we provide a thorough probabilistic analysis and extensive experimental evaluation which confirm the effectiveness of our approaches.