Word-based fixed and flexible list compression

Authors:
Ebru Celikel;Mehmet E. Dalkilic;Gokhan Dalkilic
Affiliations:
International Computer Institute, Ege University, Bornova, Izmir, Turkey;International Computer Institute, Ege University, Bornova, Izmir, Turkey;Computer Engineering Department, Dokuz Eylul University, Bornova, Izmir, Turkey
Venue:
ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Year:
2005

Citing 3
Cited 0

A locally adaptive data compression scheme

Communications of the ACM
The data compression book (2nd ed.)

The data compression book (2nd ed.)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a dictionary based lossless text compression scheme where we keep frequent words in separate lists (list_n contains words of length n). We pursued two alternatives in terms of the lengths of the lists. In the "fixed" approach all lists have equal number of words whereas in the "flexible" approach no such constraint is imposed. Results clearly show that the "flexible" scheme is much better in all test cases possibly due to the fact that it can accomodate short, medium or long word lists reflecting on the word length distributions of a particular language. Our approach encodes a word as a prefix (the length of the word) and the body of the word (as an index in the corresponding list). For prefix encoding we have employed both a static encoding and a dynamic encoding (Huffman) using the word length statistics of the source language. Dynamic prefix encoding clearly outperformed its static counterpart in all cases. A language with a higher average word length can, theoretically, benefit more from a word-list based compression approach as compared to one with a lower average word length. We have put this hypothesis to test using Turkish and English languages with average word lengths of 6.1 and 4.4, respectively. Our results strongly support the validity of this hypothesis.