A text compression scheme that allows fast searching directly in the compressed file

  • Authors:
  • Udi Manber

  • Affiliations:
  • Univ. of Arizona, Tucson

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

A new text compression scheme is presented in this article. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and precessing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.