Viewing morphology as an inference process
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A probabilistic model of information retrieval: development and comparative experiments
Information Processing and Management: an International Journal
Static index pruning for information retrieval systems
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Theory, Inference & Learning Algorithms
Information Theory, Inference & Learning Algorithms
Simplified similarity scoring using term ranks
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Inverted index compression and query processing with optimized document ordering
Proceedings of the 18th international conference on World wide web
Entropy-Based Static Index Pruning
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Compressing term positions in web indexes
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Document-Oriented Pruning of the Inverted Index in Information Retrieval Systems
WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops
Static pruning of terms in inverted files
ECIR'07 Proceedings of the 29th European conference on IR research
A statistical view of binned retrieval models
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Hi-index | 0.00 |
In this paper, we investigate the lossy compression of term frequencies in an inverted index based on quantization. Firstly, we examine the number of bits to code term frequencies with no or little degradation of retrieval performance. Both term-independent and term-specific quantizers are investigated. Next, an iterative technique is described for learning quantization step sizes. Experiments based on standard TREC test sets demonstrate that nearly no degradation of retrieval performance can be achieved by allocating only 2 or 3 bits for the quantized version of term frequencies. This is comparable to lossless coding techniques such as unary, γ and θ-codes. However, if lossless coding is applied to the quantized term frequency values, then around 26% (or 12%) savings can be achieved over lossless coding alone, with less than 2.5% (or no measurable) degradation in retrieval performance.