Compressing XML with Multiplexed Hierarchical PPM Models

  • Authors:
  • James Cheney

  • Affiliations:
  • -

  • Venue:
  • DCC '01 Proceedings of the Data Compression Conference
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: Extensible Markup Language (XML) is a standardized language that "describes a class of data objects called XML documents and partially describes the behavior of computer programs which process them." According to Bosak and Bray, XML is the "next big thing" after HTML. It is gaining momentum in many areas of the computer industry; for example, Microsoft has announced plans to base future software systems on XML. XML is not a specific markup language like HTML, but instead a meta-language for describing markup languages together with a strong standard for creating and parsing documents.Whatever XML's advantages, it has one glaring disadvantage: document size. XML's parent standard, Standardized General Markup Language (SGML), made many provisions for minimizing document size. These options made SGML complex and difficult to implement, and they were omitted from XML. Indeed, the XML standard explicitly states that markup terseness was not a design goal. Consequently, XML is easy to process but verbose. XML documents can be many times larger than equivalent non-standardized text or binary formats, ev en if compressed. There is growing concern in the XML community that inefficiency arising from document size will hinder adoption and use of XML.From an information-theoretic point of view, this situation is vexing because two sources carrying the same messages should have the same entropy and so compress to about the same size using any universal compressor. The problem is that XML documents may have nonlocal redundancy arising from XML's tree structure, which is difficult for text compressors to discover. Conversely, XML-conscious compression techniques might be able to compress XML documents as well as or better than other representations. Fortuitously, XML's simple design makes testing this hypothesis easier than in previous structured-data compression approaches such assyntax-based compression of program files or machine code compression.Liefke and Suciu describe XMILL, an XML compressor that transforms documents to expose redundancy, then applies standard text compressors. XMILL combined with gzip compresses XML data about 10% better than gzip on equivalent non-XML forms; further improvement (up to 50%) is possible with user assistance in the form of complex command-line parameters. This work shows that XML-conscious compression can do better than text compression alone. However, XMILL's base transformation has several drawbacks: namely, it precludes incremental (online) processing of compressed documents, it actually hinders compressors other than gzip, and it requires user assistance to achieve the best compression.Figure 1: XML example In this paper, we will describe alternative approaches to XML compression that illustrate other tradeoffs between speed and effectiveness. We describe experiments using several text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called Encoded SAX(ESAX) that compresses better and faster than existing methods; and an online, adaptive, XML-conscious encoding based on Prediction by Partial Match (PPM) called Multiplexed Hierarchical Modeling (MHM) that compresses up to 35% better than any existing method but is fairly slow. First, of course, we need to describe XML in more detail.