Modeling word occurrences for the compression of concordances

Authors:
A. Bookstein;S. T. Klein;T. Raita
Affiliations:
Univ. of Chicago, Chicago, IL;Bar Ilan Univ., Ramat-Gan, Israel;Univ. of Turku, Turku, Finland
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
1997

Citing 13
Cited 6

Compression of index term dictionary in an inverted-file-orientated database: some effective algorithms

Information Processing and Management: an International Journal
Data compression for a source with Markov characteristics

The Computer Journal
Data compression using dynamic Markov modelling

The Computer Journal
Compression of concordances in full-text retrieval systems

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Data compression with finite windows

Communications of the ACM
Storing text retrieval systems on CD-ROM: compression and encryption considerations

ACM Transactions on Information Systems (TOIS)
Modeling for text compression

ACM Computing Surveys (CSUR)
A systematic approach to compressing a full-text retrieval system

Information Processing and Management: an International Journal - Special issue on data compression for images and texts
Compression of indexes with full positional information in very large text databases

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Document and passage retrieval based on hidden Markov models

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Improved hierarchical bit-vector compression in document retrieval systems

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Simple Bayesian Model for Bitmap Compression

Information Retrieval
Binary Interpolative Coding for Effective Index Compression

Information Retrieval
Compressing term positions in web indexes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

An earlier paper developed a procedure for compressing concordances, assuming that all alements occurred independently. The models introduced in that paper are extended here to take the possiblity of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit locations reporesent documents, and the one-bits represent the occurrence of given terms. Hidden Markov Models (HMM's) are used to describe the clustering of the one-bits. However, for computational reasons, the HMM is approximated by traditional Markov models. A set of criteria is developed to constrain the allowable set of n-state models, and a full inventory is given for n ≤ 4. Graph-theoretic reduction and complementation operations are defined among the various models and are used to provide a structure relating the models studied. Finally, the new methods were tested on the concordances of the English Bible and of two of the world's largest full-text retrieval systems: the Tre´sor de la Langue Franc¸aise and the Responsa Project.