Compression of indexes with full positional information in very large text databases

Authors:
Gordon Linoff;Craig Stanfill
Affiliations:
-;-
Venue:
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
1993

Citing 6
Cited 5

Compression of concordances in full-text retrieval systems

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Storing text retrieval systems on CD-ROM: compression and encryption considerations

ACM Transactions on Information Systems (TOIS)
Posting compression in dynamic retrieval environments

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Parameterised compression for sparse bitmaps

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A systematic approach to compressing a full-text retrieval system

Information Processing and Management: an International Journal - Special issue on data compression for images and texts
Arithmetic coding for data compression

Communications of the ACM

Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Modeling word occurrences for the compression of concordances

ACM Transactions on Information Systems (TOIS)
Inverted files versus signature files for text indexing

ACM Transactions on Database Systems (TODS)
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
Inverted files for text search engines

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations called n-s coding. Using these compression methods on two different text sources (the King James Version of the Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databases on the CM-5, a massively parallel MIMD supercomputer.