Improving the performance guarantee for approximate graph coloring
Journal of the ACM (JACM)
A tale of three spelling checkers
Software—Practice & Experience
Optimal Semijoins for Distributed Database Systems
IEEE Transactions on Software Engineering
OPUS: preventing weak password choices
Computers and Security
SIAM Review
Approximate graph coloring by semidefinite programming
Journal of the ACM (JACM)
Low discrepancy sets yield approximate min-wise independent permutation families
Information Processing Letters
Summary cache: a scalable wide-area web cache sharing protocol
IEEE/ACM Transactions on Networking (TON)
Signature files: an access method for documents and its analytical performance evaluation
ACM Transactions on Information Systems (TOIS)
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Cluster-Based Delta Compression of a Collection of Files
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Min-wise Independent Permutations: Theory and Practice
ICALP '00 Proceedings of the 27th International Colloquium on Automata, Languages and Programming
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
The Bloomier filter: an efficient data structure for static support lookup tables
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Comparing inverted files and signature files for searching a large lexicon
Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
An optimal Bloom filter replacement
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Inverted files for text search engines
ACM Computing Surveys (CSUR)
On the false-positive rate of Bloom filters
Information Processing Letters
Efficient peer-to-peer keyword searching
Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware
Small subset queries and bloom filters using ternary associative memories, with applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Hi-index | 0.00 |
Signature file is a well-studied method in information retrieval for indexing large text databases. Because of the small index size in this method, it is a good candidate for environments where memory is scarce. This small index size, however, comes at the cost of high false positive error rate. In this paper we address the problem of high false positive error rate of signature files by introducing COCA filters, a new variation of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique in real document collections we can reduce the false positive error by up to 21 times, for the same index size. It is also shown that in some extreme cases this technique is even able to completely eliminate the false positive error. COCA filters can be considered as a good replacement for Bloom filters wherever the co-occurrence of any two members of the universe is identifiable.