Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic condensation of electronic publications by sentence selection
Information Processing and Management: an International Journal - Special issue: summarizing text
Query evaluation: strategies and optimizations
Information Processing and Management: an International Journal
Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A vector space model for automatic indexing
Communications of the ACM
Static index pruning for information retrieval systems
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance based language models
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Generic summaries for indexing in information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling score distributions for combining the outputs of search engines
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Modern Information Retrieval
Impact transformation: effective and efficient web retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Importance of Prior Probabilities for Entry Page Search
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document normalization revisited
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems (TOIS)
Probabilistic models of indexing and searching
SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval
Query-independent evidence in home page finding
ACM Transactions on Information Systems (TOIS)
SIAM Journal on Discrete Mathematics
A study of parameter tuning for term frequency normalization
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
A study of smoothing methods for language models applied to information retrieval
ACM Transactions on Information Systems (TOIS)
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Optimization strategies for complex queries
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Pruned query evaluation using pre-computed impacts
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A document-centric approach to static index pruning in text retrieval systems
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The impact of caching on search engines
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Pruning policies for two-tiered inverted index with correctness guarantee
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Boosting static pruning of inverted files
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Optimized query execution in large search engines with global page ordering
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Top-k query evaluation with probabilistic guarantees
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Static pruning of terms in inverted files
ECIR'07 Proceedings of the 29th European conference on IR research
Frequentist and bayesian approach to information retrieval
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Faster top-k document retrieval using block-max indexes
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Information preservation in static index pruning
Proceedings of the 21st ACM international conference on Information and knowledge management
Optimizing top-k document retrieval strategies for block-max indexes
Proceedings of the sixth ACM international conference on Web search and data mining
An information-theoretic account of static index pruning
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Efficient parallel block-max WAND algorithm
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hi-index | 0.00 |
Information retrieval (IR) systems typically compress their indexes in order to increase their efficiency. Static pruning is a form of lossy data compression: it removes from the index, data that is estimated to be the least important to retrieval performance, according to some criterion. Generally, pruning criteria are derived from term weighting functions, which assign weights to terms according to their contribution to a document's contents. Usually, document-term occurrences that are assigned a low weight are ruled out from the index. The main assumption is that those entries contribute little to the document content. We present a novel pruning technique that is based on a probabilistic model of IR. We employ the Probability Ranking Principle as a decision criterion over which posting list entries are to be pruned. The proposed approach requires the estimation of three probabilities, combining them in such a way that we gather all the necessary information to apply the aforementioned criterion. We evaluate our proposed pruning technique on five TREC collections and various retrieval tasks, and show that in almost every situation it outperforms the state of the art in index pruning. The main contribution of this work is proposing a pruning technique that stems directly from the same source as probabilistic retrieval models, and hence is independent of the final model used for retrieval.