Natural Language Compression on Edge-Guided text preprocessing

Authors:
Miguel A. Martínez-Prieto;Joaquín Adiego;Pablo de la Fuente
Affiliations:
Department of Computer Science, University of Valladolid, E.T.S. de Ingeniería Informática, Campus Miguel Delibes, 47011 Valladolid, Spain and Department of Computer Science, University ...;Department of Computer Science, University of Valladolid, E.T.S. de Ingeniería Informática, Campus Miguel Delibes, 47011 Valladolid, Spain;Department of Computer Science, University of Valladolid, E.T.S. de Ingeniería Informática, Campus Miguel Delibes, 47011 Valladolid, Spain
Venue:
Information Sciences: an International Journal
Year:
2011

Citing 34
Cited 2

A locally adaptive data compression scheme

Communications of the ACM
Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Data compression in full-text retrieval systems

Journal of the American Society for Information Science
Arithmetic coding for data compression

Communications of the ACM
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval

Modern Information Retrieval
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
Word-Based Compression Methods and Indexing for Text Retrieval Systems

ADBIS '99 Proceedings of the Third East European Conference on Advances in Databases and Information Systems
Models of English Text

DCC '97 Proceedings of the Conference on Data Compression
Lexical Attraction for Text Compression

DCC '99 Proceedings of the Conference on Data Compression
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Discovery of linguistic relations using lexical attraction

Discovery of linguistic relations using lexical attraction
Universal Text Preprocessing for Data Compression

IEEE Transactions on Computers
Revisiting dictionary-based compression: Research Articles

Software—Practice & Experience
Lightweight natural language text compression

Information Retrieval
Word-Based Statistical Compressors as Natural Language Compression Boosters

DCC '08 Proceedings of the Data Compression Conference
High Performance Word-Codeword Mapping Algorithm on PPM

DCC '09 Proceedings of the 2009 Data Compression Conference
Parsing with soft and hard constraints on dependency length

Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
Graph-based text representation for novelty detection

TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Matching syntactic-semantic graphs for semantic relation assignment

TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Word-based text compression using the Burrows-Wheeler transform

Information Processing and Management: an International Journal
PPM with the extended alphabet

Information Sciences: an International Journal
Compression-unimpaired batch-image encryption combining vector quantization and index compression

Information Sciences: an International Journal
Enabling energy-efficient and lossy-aware data compression in wireless sensor networks by multi-objective evolutionary optimization

Information Sciences: an International Journal
Edge-guided natural language text compression

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Fuzzy transforms for compression and decompression of color videos

Information Sciences: an International Journal
Mapping words into codewords on PPM

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression

IEEE Transactions on Information Theory
The smallest grammar problem

IEEE Transactions on Information Theory

A chapter preloading mechanism for e-reader in mobile environment

Information Sciences: an International Journal
A new compression scheme for secure transmission

International Journal of Automation and Computing

Quantified Score

Hi-index	0.07

Visualization

Abstract

This paper presents Edge-Guided (E-G), an optimized text preprocessing technique for compression purposes. It transforms the original text into a word net, which stores all relationships between adjoining words. A specific directed graph is proposed to model this transformation: words are stored in vertices, whereas edges represent word transitions. Thus, the word net has a text representation which reflects the natural word order in the text, so it can be used directly for encoding purposes. A specific coding scheme is described at the top of the word net. It regards a text as a sequence of word transitions, in such a way that each word is encoded by traversing a specific edge from the vertex which stores the preceding word. This accomplishes a 1-order text preprocessing whose output is an intermediate byte representation that can be effectively encoded with universal techniques. This technique is called E-G"1 and performs on some variants. This experience is used to revisit the concept of word net. It is used to identify significative 2-word symbols by performing a specific transformation on frequent edges. The resultant transformed word net appends these 2-word symbols to the original word vocabulary, and allows a specific hierarchical relationship between them and their component words. The transformed approach also enhances the original coding scheme to handle these new features. A new technique, called E-G"2, approximates a 2-order model on words that also support specific variants. Both techniques are studied from empirical and experimental perspectives. Some compressors are also used to analyze the preprocessing ability of E-G with respect to different compression approaches. Competitive space/time trade-offs are achieved when our approaches are used to compress medium-large size texts. The best results are achieved when E-G preprocessing is coupled with high-order compressors such as Prediction by Partial Matching (PPM).