Improving semistatic compression via pair-based coding

  • Authors:
  • Nieves R. Brisaboa;Antonio Fariña;Gonzalo Navarro;José R. Paramá

  • Affiliations:
  • Database Lab., Univ. da Coruña, Facultade de Informática, Spain;Database Lab., Univ. da Coruña, Facultade de Informática, Spain;Dept. of Computer Science, Univ. de Chile, Santiago, Chile;Database Lab., Univ. da Coruña, Facultade de Informática, Spain

  • Venue:
  • PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the last years, new semistatic word-based byte-oriented compressors, such as Plain and Tagged Huffman and the Dense Codes, have been used to improve the efficiency of text retrieval systems, while reducing the compressed collections to 30-35% of their original size. In this paper, we present a new semistatic compressor, called Pair-Based End-Tagged Dense Code (PETDC). PETDC compresses English texts to 27-28%, overcoming the optimal 0-order prefix-free semistatic compressor (Plain Huffman) in more than 3 percentage points. Moreover, PETDC permits also random decompression, and direct searches using fast Boyer-Moore algorithms. PETDC builds a vocabulary with both words and pairs of words. The basic idea in which PETDC is based is that, since each symbol in the vocabulary is given a codeword, compression is improved by replacing two words of the source text by a unique codeword.