Entropy-Based Static Index Pruning

Authors:
Lei Zheng;Ingemar J. Cox
Affiliations:
University College London, Suffolk, United Kingdom IP5 3RE;University College London, Suffolk, United Kingdom IP5 3RE
Venue:
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Year:
2009

Citing 7
Cited 3

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
A stop list for general text

ACM SIGIR Forum
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A document-centric approach to static index pruning in text retrieval systems

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Introduction to Information Retrieval

Introduction to Information Retrieval
Static pruning of terms in inverted files

ECIR'07 Proceedings of the 29th European conference on IR research

Term frequency quantization for compressing an inverted index

AMT'10 Proceedings of the 6th international conference on Active media technology
Information preservation in static index pruning

Proceedings of the 21st ACM international conference on Information and knowledge management
An information-theoretic account of static index pruning

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new entropy-based algorithm for static index pruning. The algorithm computes an importance score for each document in the collection based on the entropy of each term. A threshold is set according to the desired level of pruning and all postings associated with documents that score below this threshold are removed from the index, i.e. documents are removed from the collection. We compare this entropy-based approach with previous work by Carmel et al. [1], for both the Financial Times (FT) and Los Angeles Times (LA) collections. Experimental results reveal that the entropy-based approach has superior performance on the FT collection, for both precision at 10 (P@10) and mean average precision (MAP). However, for the LA collection, Carmel's method is generally superior with MAP. The variation in performance across collections suggests that a hybrid algorithm that incorporates elements of both methods might have more stable performance across collections. A simple hybrid method is tested, in which a first 10% pruning is performed using the entropy-based method, and further pruning is performed by Carmel's method. Experimental results show that the hybird algorithm can slightly improve that of Carmel's, but performs significantly worse than the entropy-based method on the FT collection.