Modeling word occurrences for the compression of concordances

  • Authors:
  • A. Bookstein;S. T. Klein;T. Raita

  • Affiliations:
  • Univ. of Chicago, Chicago, IL;Bar Ilan Univ., Ramat-Gan, Israel;Univ. of Turku, Turku, Finland

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

An earlier paper developed a procedure for compressing concordances, assuming that all alements occurred independently. The models introduced in that paper are extended here to take the possiblity of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit locations reporesent documents, and the one-bits represent the occurrence of given terms. Hidden Markov Models (HMM's) are used to describe the clustering of the one-bits. However, for computational reasons, the HMM is approximated by traditional Markov models. A set of criteria is developed to constrain the allowable set of n-state models, and a full inventory is given for n ≤ 4. Graph-theoretic reduction and complementation operations are defined among the various models and are used to provide a structure relating the models studied. Finally, the new methods were tested on the concordances of the English Bible and of two of the world's largest full-text retrieval systems: the Tre´sor de la Langue Franc¸aise and the Responsa Project.