Optimizing XML Compression

Authors:
Gregory Leighton;Denilson Barbosa
Affiliations:
University of Alberta, Edmonton, Canada;University of Alberta, Edmonton, Canada
Venue:
XSym '09 Proceedings of the 6th International XML Database Symposium on Database and XML Technologies
Year:
2009

Citing 9
Cited 0

XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
XPRESS: a queriable compression for XML data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Compressing XML with Multiplexed Hierarchical PPM Models

DCC '01 Proceedings of the Data Compression Conference
XGRIND: A Query-Friendly XML Compressor

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
AXECHOP: A Grammar-based Compressor for XML

DCC '05 Proceedings of the Data Compression Conference
Combining Structural and Textual Contexts for Compressing Semistructured Databases

ENC '05 Proceedings of the Sixth Mexican International Conference on Computer Science
XQueC: A query-conscious compressed XML database

ACM Transactions on Internet Technology (TOIT)
Effective asymmetric XML compression

Software—Practice & Experience
XML Tree Structure Compression

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application

Quantified Score

Hi-index	0.00

Visualization

Abstract

The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yields also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance. Hence, choosing a compression configuration that optimizes compression gain requires one to determine (1) a partitioning strategy for document content, and (2) the best available compression algorithm to apply to each set within this partition. In this paper, we show that finding an optimal compression configuration with respect to compression gain is an NP-hard optimization problem. This problem remains intractable even if one considers a single compression algorithm for all content. We also describe an approximation algorithm for selecting a partitioning strategy for document content based on the branch-and-bound paradigm.