Improving semistatic compression via pair-based coding

Authors:
Nieves R. Brisaboa;Antonio Fariña;Gonzalo Navarro;José R. Paramá
Affiliations:
Database Lab., Univ. da Coruña, Facultade de Informática, Spain;Database Lab., Univ. da Coruña, Facultade de Informática, Spain;Dept. of Computer Science, Univ. de Chile, Santiago, Chile;Database Lab., Univ. da Coruña, Facultade de Informática, Spain
Venue:
PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Year:
2006

Citing 9
Cited 2

Word-based text compression

Software—Practice & Experience
Text compression

Text compression
String matching in Lempel-Ziv compressed strings

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
A fast string searching algorithm

Communications of the ACM
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
Lightweight natural language text compression

Information Retrieval
An efficient compression code for text databases

ECIR'03 Proceedings of the 25th European conference on IR research

Improving semistatic compression via phrase-based modeling

Information Processing and Management: an International Journal
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last years, new semistatic word-based byte-oriented compressors, such as Plain and Tagged Huffman and the Dense Codes, have been used to improve the efficiency of text retrieval systems, while reducing the compressed collections to 30-35% of their original size. In this paper, we present a new semistatic compressor, called Pair-Based End-Tagged Dense Code (PETDC). PETDC compresses English texts to 27-28%, overcoming the optimal 0-order prefix-free semistatic compressor (Plain Huffman) in more than 3 percentage points. Moreover, PETDC permits also random decompression, and direct searches using fast Boyer-Moore algorithms. PETDC builds a vocabulary with both words and pairs of words. The basic idea in which PETDC is based is that, since each symbol in the vocabulary is given a codeword, compression is improved by replacing two words of the source text by a unique codeword.