Within-document term-based index pruning with statistical hypothesis testing

Authors:
Sree Lekha Thota;Ben Carterette
Affiliations:
Department of Computer and Information Sciences, University of Delaware, Newark, DE;Department of Computer and Information Sciences, University of Delaware, Newark, DE
Venue:
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Year:
2011

Citing 13
Cited 1

Filtered document retrieval with frequency-sorted indexes

Journal of the American Society for Information Science
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Compressing Inverted Files

Information Retrieval
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Improving Web search efficiency via a locality based static pruning method

WWW '05 Proceedings of the 14th international conference on World Wide Web
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Boosting static pruning of inverted files

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic index pruning for effective caching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
An efficient computation of the multiple-bernoulli language model

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

An information-theoretic account of static index pruning

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some withindocument term information from inverted lists. We present a method of pruning inverted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to decide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuristic methods. Furthermore, we give a formal statistical justification for such methods.