Modeling word occurrences for the compression of concordances

  • Authors:
  • A. Bookstein;P. Lankinen;C. W. Sze

  • Affiliations:
  • -;-;-

  • Venue:
  • DCC '95 Proceedings of the Conference on Data Compression
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

Summary form only given. Effective compression of a text-based information retrieval system involves compression not only the text itself, but also of the concordance by which one accesses that text and which occupies an amount of storage comparable to the text itself. The concordance can be a rather complicated data structure, especially if it permits hierarchical access to the database. But one or more components of the hierarchy can usually be conceptualized as a bit-map. We conceptualize our bit-map as being generated as follows. At any bit-map site we are in one of two states: a cluster state (C), or a between-cluster state (B). In a given state, we generate a bit-map-value of zero or one and, governed by the transition probabilities of the model, enter a new state as we move to the next bit-map site. Such a model has been referred to as a hidden Markov model in the literature. Unfortunately, this model is analytically difficult to use. To approximate it, we introduce several traditional Markov models with four states each, B and C as above, and two transitional states. We present the models, show how they are connected, and state the formal compression algorithm based on these models. We also include some experimental results.