Word-Based Compression Methods for Large Text Documents

  • Authors:
  • Jiøí Dvorský;Jaroslav Pokorný;Václav Snásel

  • Affiliations:
  • -;-;-

  • Venue:
  • DCC '99 Proceedings of the Conference on Data Compression
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this article we present a new compression method, called WLZW, which is a word-based modification of classic LZW. The modification is similar to the approach used in the Hu_Word[3] compression algorithm. Due to special using WLZW in text databases, some its features seem to be preferable in comparing to similar previous approaches[1, 2]. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for lexicon is usable as text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens were divided in several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non- words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is important piece of information. With this knowledge kind of next token can be predicted. One selected (so called victim) non-word can be deleted from input stream. Algorithm how to search victim is also presented. In decompression phase, deleted victim is recognized as error in alternating of words and non-words in sequence.