Universal Text Preprocessing for Data Compression

Authors:
Jurgen Abel;William Teahan
Affiliations:
IEEE;IEEE
Venue:
IEEE Transactions on Computers
Year:
2005

Citing 15
Cited 5

A locally adaptive data compression scheme

Communications of the ACM
Word-based text compression

Software—Practice & Experience
Fast algorithms for sorting and searching strings

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Enhanced word-based block-sorting text compression

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
Data Compression Using Encrypted Text

DCC '96 Proceedings of the Conference on Data Compression
The entropy of English using PPM-based models

DCC '96 Proceedings of the Conference on Data Compression
Models of English Text

DCC '97 Proceedings of the Conference on Data Compression
Improving Text Compression Ratios with the Burrows-Wheeler Transform

DCC '99 Proceedings of the Conference on Data Compression
Dictionary-Based Fast Transform for Text Compression

ITCC '03 Proceedings of the International Conference on Information Technology: Computers and Communications
Preprocessing Text to Improve Compression Ratios

DCC '98 Proceedings of the Conference on Data Compression
Higher Compression from the Burrows-Wheeler Transform by Modified Sorting

DCC '98 Proceedings of the Conference on Data Compression
Parsing Strategies for BWT Compression

DCC '01 Proceedings of the Data Compression Conference
LIPT: A Reversible Lossless Text Transform to Improve Compression Performance

DCC '01 Proceedings of the Data Compression Conference
Combining PPM Models Using A Text Mining Approach

DCC '01 Proceedings of the Data Compression Conference
Higher compression from the burrows-wheeler transform with new algorithms for the list update problem

Higher compression from the burrows-wheeler transform with new algorithms for the list update problem

Revisiting dictionary-based compression: Research Articles

Software—Practice & Experience
The use of genetic programming for adaptive text compression

International Journal of Information and Coding Theory
Post BWT stages of the Burrows–Wheeler compression algorithm

Software—Practice & Experience
Natural Language Compression on Edge-Guided text preprocessing

Information Sciences: an International Journal
FXProj: a fuzzy XML documents projected clustering based on structure and content

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I

Quantified Score

Hi-index	14.98

Visualization

Abstract

Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.