Probabilistic static pruning of inverted files

Authors:
Roi Blanco;Alvaro Barreiro
Affiliations:
University of A Coruña, Spain;University of A Coruña, Spain
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2010

Citing 35
Cited 5

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic condensation of electronic publications by sentence selection

Information Processing and Management: an International Journal - Special issue: summarizing text
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A vector space model for automatic indexing

Communications of the ACM
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Generic summaries for indexing in information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling score distributions for combining the outputs of search engines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Impact transformation: effective and efficient web retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document normalization revisited

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Probabilistic models of indexing and searching

SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
Query-independent evidence in home page finding

ACM Transactions on Information Systems (TOIS)
Comparing Top k Lists

SIAM Journal on Discrete Mathematics
A study of parameter tuning for term frequency normalization

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Optimization strategies for complex queries

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Pruned query evaluation using pre-computed impacts

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The impact of caching on search engines

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Pruning policies for two-tiered inverted index with correctness guarantee

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting static pruning of inverted files

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Top-k query evaluation with probabilistic guarantees

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Static pruning of terms in inverted files

ECIR'07 Proceedings of the 29th European conference on IR research
Frequentist and bayesian approach to information retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Faster top-k document retrieval using block-max indexes

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Information preservation in static index pruning

Proceedings of the 21st ACM international conference on Information and knowledge management
Optimizing top-k document retrieval strategies for block-max indexes

Proceedings of the sixth ACM international conference on Web search and data mining
An information-theoretic account of static index pruning

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Efficient parallel block-max WAND algorithm

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval performance, according to some criterion. Generally, pruning criteria are derived from term weighting functions, which assign weights to terms according to their contribution to a document's contents. Usually, document-term occurrences that are assigned a low weight are ruled out from the index. The main assumption is that those entries contribute little to the document content. We present a novel pruning technique that is based on a probabilistic model of IR. We employ the Probability Ranking Principle as a decision criterion over which posting list entries are to be pruned. The proposed approach requires the estimation of three probabilities, combining them in such a way that we gather all the necessary information to apply the aforementioned criterion. We evaluate our proposed pruning technique on five TREC collections and various retrieval tasks, and show that in almost every situation it outperforms the state of the art in index pruning. The main contribution of this work is proposing a pruning technique that stems directly from the same source as probabilistic retrieval models, and hence is independent of the final model used for retrieval.