Word-Based Compression Methods for Large Text Documents

Authors:
Jiøí Dvorský;Jaroslav Pokorný;Václav Snásel
Affiliations:
-;-;-
Venue:
DCC '99 Proceedings of the Conference on Data Compression
Year:
1999

Citing 0
Cited 3

Word-Based Compression Methods and Indexing for Text Retrieval Systems

ADBIS '99 Proceedings of the Third East European Conference on Advances in Databases and Information Systems
Word Random Access Compression

CIAA '00 Revised Papers from the 5th International Conference on Implementation and Application of Automata
Document Classification Based on the Topic Evaluation and Its Usage in Data Compression

WI-IATW '07 Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we present a new compression method, called WLZW, which is a word-based modification of classic LZW. The modification is similar to the approach used in the Hu_Word[3] compression algorithm. Due to special using WLZW in text databases, some its features seem to be preferable in comparing to similar previous approaches[1, 2]. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for lexicon is usable as text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens were divided in several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non- words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is important piece of information. With this knowledge kind of next token can be predicted. One selected (so called victim) non-word can be deleted from input stream. Algorithm how to search victim is also presented. In decompression phase, deleted victim is recognized as error in alternating of words and non-words in sequence.