Compressing Inverted Files

Authors:
Andrew Trotman
Affiliations:
Department of Computer Science, University of Otago, PO Box 56, Dunedin, New Zealand
Venue:
Information Retrieval
Year:
2003

Citing 0
Cited 22

Index compression using fixed binary codewords

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Inverted Index Compression Using Word-Aligned Binary Codes

Information Retrieval
Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems

Information Processing and Management: an International Journal
Improved Word-Aligned Binary Compression for Text Indexing

IEEE Transactions on Knowledge and Data Engineering
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Efficient plagiarism detection for large code repositories

Software—Practice & Experience
Efficient in-memory extensible inverted file

Information Systems
Sigma encoded inverted files

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Efficient index compression in DB2 LUW

Proceedings of the VLDB Endowment
Unique-order interpolative coding for fast querying and space-efficient indexing in information retrieval systems

Information Processing and Management: an International Journal
Index compression using 64-bit words

Software—Practice & Experience
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Fast and effective focused retrieval

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Within-document term-based index pruning with statistical hypothesis testing

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
University of Otago at INEX 2010

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
An impact ordering approach for indexing fuzzy sets

Fuzzy Sets and Systems
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Managing short postings lists

Proceedings of the 18th Australasian Document Computing Symposium
Re-Ordered FEGC and Block Based FEGC for Inverted File Compression

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.The premise “smaller is better” may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput.Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.