Using structural contexts to compress semistructured text collections

Authors:
Joaquín Adiego;Gonzalo Navarro;Pablo de la Fuente
Affiliations:
Depto. de Informática, Universidad de Valladolid, ETIyT - Campus Miguel Delibes, Camino del Cementerio s/n, 47011 Valladolid, Valladolid, Spain;Depto. de Ciencias de la Computación, Universidad de Chile, Santiago, Chile;Depto. de Informática, Universidad de Valladolid, ETIyT - Campus Miguel Delibes, Camino del Cementerio s/n, 47011 Valladolid, Valladolid, Spain
Venue:
Information Processing and Management: an International Journal
Year:
2007

Citing 18
Cited 6

A locally adaptive data compression scheme

Communications of the ACM
Word-based text compression

Software—Practice & Experience
Arithmetic coding for data compression

Communications of the ACM
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Millau: an encoding format for efficient representation and exchange of XML over the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Compression and Coding Algorithms

Compression and Coding Algorithms
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Compressing XML with Multiplexed Hierarchical PPM Models

DCC '01 Proceedings of the Data Compression Conference
XGRIND: A Query-Friendly XML Compressor

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Lempel-Ziv Compression of Structured Text

DCC '04 Proceedings of the Conference on Data Compression
User modeling for personalized Web search with self-organizing map: Research Articles

Journal of the American Society for Information Science and Technology
A Technique for High-Performance Data Compression

Computer
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory

Effective asymmetric XML compression

Software—Practice & Experience
Visually Lossless HTML Compression

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
A highly efficient XML compression scheme for the web

SOFSEM'08 Proceedings of the 34th conference on Current trends in theory and practice of computer science
Partial retrieval of compressed semi-structured documents

International Journal of Computer Applications in Technology
JSZap: compressing JavaScript code

WebApps'10 Proceedings of the 2010 USENIX conference on Web application development
XML tree structure compression using RePair

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCMHuff, retains those features and improves Huffman's compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This gives an additional improvement over the plain technique. The comparison against existing prototypes shows that, among the methods that permit random access to the collection, SCMHuff achieves the best compression ratios, 2-4% better than the closest alternative. From a purely compression-aimed perspective, we combine SCM with PPM modeling. A separate PPM model is used to compress the text that lies inside each different structure type. The result, SCMPPM, does not permit random access nor direct search in the compressed text, but it gives 2-5% better compression ratios than other techniques for texts longer than 5MB.