Compression of concordances in full-text retrieval systems
SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Storing text retrieval systems on CD-ROM: compression and encryption considerations
ACM Transactions on Information Systems (TOIS)
Posting compression in dynamic retrieval environments
SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Parameterised compression for sparse bitmaps
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A systematic approach to compressing a full-text retrieval system
Information Processing and Management: an International Journal - Special issue on data compression for images and texts
Arithmetic coding for data compression
Communications of the ACM
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
Modeling word occurrences for the compression of concordances
ACM Transactions on Information Systems (TOIS)
Inverted files versus signature files for text indexing
ACM Transactions on Database Systems (TODS)
Adding Compression to Block Addressing Inverted Indexes
Information Retrieval
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
This paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. These methods are Prefix Omission, Run-Length Encoding, and a novel family of numeric representations called n-s coding. Using these compression methods on two different text sources (the King James Version of the Bible and a sample of Wall Street Journal Stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. The decreased time required for I/O can almost fully compensate for the time needed to uncompress the postings. This research is part of an effort to handle very large text databases on the CM-5, a massively parallel MIMD supercomputer.