Storing text retrieval systems on CD-ROM: compression and encryption considerations

  • Authors:
  • Shmuel T. Klein;Abraham Bookstein;Scott Deerwester

  • Affiliations:
  • Univ. of Chicago, Chicago, IL;Univ. of Chicago, Chicago, IL;Univ. of Chicago, Chicago, IL

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 1989

Quantified Score

Hi-index 0.00

Visualization

Abstract

The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Trésor de la Langue Fran&ccidel;aise on a CD-ROM is examined in this paper. The text alone of this database is 700 megabytes long, more than a CD-ROM can hold. In addition, the dictionary and concordance needed to access these data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: Specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishable from a representation of coin flips.