Within-document term-based index pruning with statistical hypothesis testing

  • Authors:
  • Sree Lekha Thota;Ben Carterette

  • Affiliations:
  • Department of Computer and Information Sciences, University of Delaware, Newark, DE;Department of Computer and Information Sciences, University of Delaware, Newark, DE

  • Venue:
  • ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some withindocument term information from inverted lists. We present a method of pruning inverted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuristic methods. Furthermore, we give a formal statistical justification for such methods.