Mapping words into codewords on PPM

Authors:
Joaquín Adiego;Pablo de la Fuente
Affiliations:
Depto. de Informática, Universidad de Valladolid, Valladolid, Spain;Depto. de Informática, Universidad de Valladolid, Valladolid, Spain
Venue:
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Year:
2006

Citing 15
Cited 4

A locally adaptive data compression scheme

Communications of the ACM
Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Data compression in full-text retrieval systems

Journal of the American Society for Information Science
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval

Modern Information Retrieval
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
Word-Based Compression Methods and Indexing for Text Retrieval Systems

ADBIS '99 Proceedings of the Third East European Conference on Advances in Databases and Information Systems
A Corpus for the Evaluation of Lossless Compression Algorithms

DCC '97 Proceedings of the Conference on Data Compression
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Compressing XML with Multiplexed Hierarchical PPM Models

DCC '01 Proceedings of the Data Compression Conference
Merging Prediction by Partial Matching with Structural Contexts Model

DCC '04 Proceedings of the Conference on Data Compression
Revisiting dictionary-based compression: Research Articles

Software—Practice & Experience
Word-based text compression using the Burrows-Wheeler transform

Information Processing and Management: an International Journal

Visually Lossless HTML Compression

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Edge-guided natural language text compression

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Natural Language Compression on Edge-Guided text preprocessing

Information Sciences: an International Journal
Generalized biwords for bitext compression and translation spotting

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.01

Visualization

Abstract

We describe a simple and efficient scheme which allows words to be managed in PPM modelling when a natural language text file is being compressed. The main idea for managing words is to assign them codes to make them easier to manipulate. A general technique is used to obtain this objective: a dictionary mapping on PPM modelling. In order to test our idea, we are implementing three prototypes: one implements the basic dictionary mapping on PPM, another implements the dictionary mapping with the separate alphabets model and the last one implements the dictionary with the spaceless words model. This technique can be applied directly or it can be combined with some word compression model. The results for files of 1 Mb. and over are better than those achieved by the character PPM which was taken as a base. The comparison between different prototypes shows that the best option is to use a word based PPM in conjunction with the spaceless word concept.