A novel approach for leveraging co-occurrence to improve the false positive error in signature files

  • Authors:
  • Pedram Ghodsnia;Kamran Tirdad;J. Ian Munro;Alejandro LóPez-Ortiz

  • Affiliations:
  • Cheriton School of Computer Science, University of Waterloo, Canada;Cheriton School of Computer Science, University of Waterloo, Canada;Cheriton School of Computer Science, University of Waterloo, Canada;Cheriton School of Computer Science, University of Waterloo, Canada

  • Venue:
  • Journal of Discrete Algorithms
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Signature file is a well-studied method in information retrieval for indexing large text databases. Because of the small index size in this method, it is a good candidate for environments where memory is scarce. This small index size, however, comes at the cost of high false positive error rate. In this paper we address the problem of high false positive error rate of signature files by introducing COCA filters, a new variation of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique in real document collections we can reduce the false positive error by up to 21 times, for the same index size. It is also shown that in some extreme cases this technique is even able to completely eliminate the false positive error. COCA filters can be considered as a good replacement for Bloom filters wherever the co-occurrence of any two members of the universe is identifiable.