COCA filters: co-occurrence aware bloom filters

Authors:
Kamran Tirdad;Pedram Ghodsnia;J. Ian Munro;Alejandro López-Ortiz
Affiliations:
Cheriton School of Computer Science, University of Waterloo;Cheriton School of Computer Science, University of Waterloo;Cheriton School of Computer Science, University of Waterloo;Cheriton School of Computer Science, University of Waterloo
Venue:
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Year:
2011

Citing 20
Cited 1

A tale of three spelling checkers

Software—Practice & Experience
Optimal Semijoins for Distributed Database Systems

IEEE Transactions on Software Engineering
OPUS: preventing weak password choices

Computers and Security
Low discrepancy sets yield approximate min-wise independent permutation families

Information Processing Letters
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
Signature files: an access method for documents and its analytical performance evaluation

ACM Transactions on Information Systems (TOIS)
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Cluster-Based Delta Compression of a Collection of Files

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Min-wise Independent Permutations: Theory and Practice

ICALP '00 Proceedings of the 27th International Colloquium on Automata, Languages and Programming
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Spectral bloom filters

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
The Bloomier filter: an efficient data structure for static support lookup tables

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Comparing inverted files and signature files for searching a large lexicon

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
An optimal Bloom filter replacement

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Inverted files for text search engines

ACM Computing Surveys (CSUR)
On the false-positive rate of Bloom filters

Information Processing Letters
Efficient peer-to-peer keyword searching

Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Small subset queries and bloom filters using ternary associative memories, with applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems

Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an indexing data structure based on a novel variation of Bloom filters. Signature files have been proposed in the past as a method to index large text databases though they suffer from a high false positive error problem. In this paper we introduce COCA Filters, a new type of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique we can reduce the false positive error by up to 21.6 times for the same index size. Furthermore Bloom filters can be replaced by COCA filters wherever the co-occurrence of any two members of the universe is identifiable.