A locally adaptive data compression scheme
Communications of the ACM
Software—Practice & Experience
Arithmetic coding for data compression
Communications of the ACM
XMill: an efficient compressor for XML data
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Millau: an encoding format for efficient representation and exchange of XML over the Web
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Fast and flexible word searching on compressed text
ACM Transactions on Information Systems (TOIS)
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Compression and Coding Algorithms
Compression and Coding Algorithms
Adding Compression to Block Addressing Inverted Indexes
Information Retrieval
DCC '02 Proceedings of the Data Compression Conference
Compressing XML with Multiplexed Hierarchical PPM Models
DCC '01 Proceedings of the Data Compression Conference
XGRIND: A Query-Friendly XML Compressor
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Lempel-Ziv Compression of Structured Text
DCC '04 Proceedings of the Conference on Data Compression
User modeling for personalized Web search with self-organizing map: Research Articles
Journal of the American Society for Information Science and Technology
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding
IEEE Transactions on Information Theory
Effective asymmetric XML compression
Software—Practice & Experience
Visually Lossless HTML Compression
WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
A highly efficient XML compression scheme for the web
SOFSEM'08 Proceedings of the 34th conference on Current trends in theory and practice of computer science
Partial retrieval of compressed semi-structured documents
International Journal of Computer Applications in Technology
JSZap: compressing JavaScript code
WebApps'10 Proceedings of the 2010 USENIX conference on Web application development
XML tree structure compression using RePair
Information Systems
Hi-index | 0.00 |
We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCMHuff, retains those features and improves Huffman's compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This gives an additional improvement over the plain technique. The comparison against existing prototypes shows that, among the methods that permit random access to the collection, SCMHuff achieves the best compression ratios, 2-4% better than the closest alternative. From a purely compression-aimed perspective, we combine SCM with PPM modeling. A separate PPM model is used to compress the text that lies inside each different structure type. The result, SCMPPM, does not permit random access nor direct search in the compressed text, but it gives 2-5% better compression ratios than other techniques for texts longer than 5MB.