Exploiting clustering in inverted file Compression

Authors:
A. Moffat;L. Stuiver
Affiliations:
-;-
Venue:
DCC '96 Proceedings of the Conference on Data Compression
Year:
1996

Citing 0
Cited 10

Compressed inverted files with reduced decoding overheads

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Peer-to-peer based recommendations for mobile commerce

WMC '01 Proceedings of the 1st international workshop on Mobile commerce
Simple Bayesian Model for Bitmap Compression

Information Retrieval
A General Approach to Compression of Hierarchical Indexes

DEXA '01 Proceedings of the 12th International Conference on Database and Expert Systems Applications
Index Compression through Document Reordering

DCC '02 Proceedings of the Data Compression Conference
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Interpolative coding of integer sequences supporting log-time random access

Information Processing and Management: an International Journal
Quasi-succinct indices

Proceedings of the sixth ACM international conference on Web search and data mining
On the compression of search trees

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document databases contain large volumes of text, and currently have typical sizes into the gigabyte range. In order to efficiently query these text collections some form of index is required, since without an index even the fastest of pattern matching techniques results in unacceptable response times. One pervasive indexing method is the use of inverted files, also sometimes known as concordances or postings files. There has been a number of effort made to capture the "clustering" effect, and to design index compression methods that condition their probability predictions according to context. In these methods information as to whether or not the most recent (or second most recent, and so on) document contained term t is used to bias the prediction that the next document will contain term t. We further extend this notion of context-based index compression, and describe a surprisingly simple index representation that gives excellent performance on all of our test databases; allows fast decoding; and is, even in the worst case, only slightly inferior to Golomb (1966) coding.