A tale of three spelling checkers
Software—Practice & Experience
Optimal Semijoins for Distributed Database Systems
IEEE Transactions on Software Engineering
OPUS: preventing weak password choices
Computers and Security
Low discrepancy sets yield approximate min-wise independent permutation families
Information Processing Letters
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Signature files: an access method for documents and its analytical performance evaluation
ACM Transactions on Information Systems (TOIS)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Cluster-Based Delta Compression of a Collection of Files
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Min-wise Independent Permutations: Theory and Practice
ICALP '00 Proceedings of the 27th International Colloquium on Automata, Languages and Programming
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
The Bloomier filter: an efficient data structure for static support lookup tables
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Comparing inverted files and signature files for searching a large lexicon
Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
An optimal Bloom filter replacement
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Inverted files for text search engines
ACM Computing Surveys (CSUR)
On the false-positive rate of Bloom filters
Information Processing Letters
Efficient peer-to-peer keyword searching
Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Small subset queries and bloom filters using ternary associative memories, with applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast candidate generation for real-time tweet search with bloom filter chains
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
We propose an indexing data structure based on a novel variation of Bloom filters. Signature files have been proposed in the past as a method to index large text databases though they suffer from a high false positive error problem. In this paper we introduce COCA Filters, a new type of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique we can reduce the false positive error by up to 21.6 times for the same index size. Furthermore Bloom filters can be replaced by COCA filters wherever the co-occurrence of any two members of the universe is identifiable.