A locally adaptive data compression scheme
Communications of the ACM
Software—Practice & Experience
Text compression
Data compression in full-text retrieval systems
Journal of the American Society for Information Science
Arithmetic coding for data compression
Communications of the ACM
Fast and flexible word searching on compressed text
ACM Transactions on Information Systems (TOIS)
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Modern Information Retrieval
Word-Based Compression Methods and Indexing for Text Retrieval Systems
ADBIS '99 Proceedings of the Third East European Conference on Advances in Databases and Information Systems
DCC '97 Proceedings of the Conference on Data Compression
Lexical Attraction for Text Compression
DCC '99 Proceedings of the Conference on Data Compression
DCC '02 Proceedings of the Data Compression Conference
Discovery of linguistic relations using lexical attraction
Discovery of linguistic relations using lexical attraction
Universal Text Preprocessing for Data Compression
IEEE Transactions on Computers
Revisiting dictionary-based compression: Research Articles
Software—Practice & Experience
Lightweight natural language text compression
Information Retrieval
Word-Based Statistical Compressors as Natural Language Compression Boosters
DCC '08 Proceedings of the Data Compression Conference
High Performance Word-Codeword Mapping Algorithm on PPM
DCC '09 Proceedings of the 2009 Data Compression Conference
Parsing with soft and hard constraints on dependency length
Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
Graph-based text representation for novelty detection
TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Matching syntactic-semantic graphs for semantic relation assignment
TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Word-based text compression using the Burrows-Wheeler transform
Information Processing and Management: an International Journal
PPM with the extended alphabet
Information Sciences: an International Journal
Compression-unimpaired batch-image encryption combining vector quantization and index compression
Information Sciences: an International Journal
Information Sciences: an International Journal
Edge-guided natural language text compression
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Fuzzy transforms for compression and decompression of color videos
Information Sciences: an International Journal
Mapping words into codewords on PPM
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Grammar-based codes: a new class of universal lossless source codes
IEEE Transactions on Information Theory
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
A chapter preloading mechanism for e-reader in mobile environment
Information Sciences: an International Journal
A new compression scheme for secure transmission
International Journal of Automation and Computing
Hi-index | 0.07 |
This paper presents Edge-Guided (E-G), an optimized text preprocessing technique for compression purposes. It transforms the original text into a word net, which stores all relationships between adjoining words. A specific directed graph is proposed to model this transformation: words are stored in vertices, whereas edges represent word transitions. Thus, the word net has a text representation which reflects the natural word order in the text, so it can be used directly for encoding purposes. A specific coding scheme is described at the top of the word net. It regards a text as a sequence of word transitions, in such a way that each word is encoded by traversing a specific edge from the vertex which stores the preceding word. This accomplishes a 1-order text preprocessing whose output is an intermediate byte representation that can be effectively encoded with universal techniques. This technique is called E-G"1 and performs on some variants. This experience is used to revisit the concept of word net. It is used to identify significative 2-word symbols by performing a specific transformation on frequent edges. The resultant transformed word net appends these 2-word symbols to the original word vocabulary, and allows a specific hierarchical relationship between them and their component words. The transformed approach also enhances the original coding scheme to handle these new features. A new technique, called E-G"2, approximates a 2-order model on words that also support specific variants. Both techniques are studied from empirical and experimental perspectives. Some compressors are also used to analyze the preprocessing ability of E-G with respect to different compression approaches. Competitive space/time trade-offs are achieved when our approaches are used to compress medium-large size texts. The best results are achieved when E-G preprocessing is coupled with high-order compressors such as Prediction by Partial Matching (PPM).