Term frequency quantization for compressing an inverted index

Authors:
Lei Zheng;Ingemar J. Cox
Affiliations:
Department of Computer Science, University College London, London, United Kingdom;Department of Computer Science, University College London, London, United Kingdom
Venue:
AMT'10 Proceedings of the 6th international conference on Active media technology
Year:
2010

Citing 13
Cited 0

Viewing morphology as an inference process

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A probabilistic model of information retrieval: development and comparative experiments

Information Processing and Management: an International Journal
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Theory, Inference & Learning Algorithms

Information Theory, Inference & Learning Algorithms
Simplified similarity scoring using term ranks

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Inverted index compression and query processing with optimized document ordering

Proceedings of the 18th international conference on World wide web
Entropy-Based Static Index Pruning

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Document-Oriented Pruning of the Inverted Index in Information Retrieval Systems

WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops
Static pruning of terms in inverted files

ECIR'07 Proceedings of the 29th European conference on IR research
A statistical view of binned retrieval models

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we investigate the lossy compression of term frequencies in an inverted index based on quantization. Firstly, we examine the number of bits to code term frequencies with no or little degradation of retrieval performance. Both term-independent and term-specific quantizers are investigated. Next, an iterative technique is described for learning quantization step sizes. Experiments based on standard TREC test sets demonstrate that nearly no degradation of retrieval performance can be achieved by allocating only 2 or 3 bits for the quantized version of term frequencies. This is comparable to lossless coding techniques such as unary, γ and θ-codes. However, if lossless coding is applied to the quantized term frequency values, then around 26% (or 12%) savings can be achieved over lossless coding alone, with less than 2.5% (or no measurable) degradation in retrieval performance.