Improving semistatic compression via phrase-based modeling

Authors:
Nieves R. Brisaboa;Antonio Fariña;Gonzalo Navarro;José R. Paramá
Affiliations:
Database Lab, Facultade de Informática, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Database Lab, Facultade de Informática, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile;Database Lab, Facultade de Informática, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain
Venue:
Information Processing and Management: an International Journal
Year:
2011

Citing 15
Cited 0

A locally adaptive data compression scheme

Communications of the ACM
Word-based text compression

Software—Practice & Experience
An approximation algorithm for space-optimal encoding of a text

The Computer Journal
Fast text searching: allowing errors

Communications of the ACM
Fast searching on compressed text allowing errors

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Experiments in text file compression

Communications of the ACM
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
An approach to phrase selection for offline data compression

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Block merging for off-line compression

Journal of the American Society for Information Science and Technology
Lightweight natural language text compression

Information Retrieval
Improving semistatic compression via pair-based coding

PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet. In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer-Moore algorithms.