Effective asymmetric XML compression

  • Authors:
  • Przemysław Skibiński;Szymon Grabowski;Jakub Swacha

  • Affiliations:
  • Institute of Computer Science, University of Wrocław, Joliot-Curie 15, 50-383 Wrocław, Poland;Computer Engineering Department, Technical University of Łódź, Politechniki 11, 90-924 Łódź, Poland;Institute of Information Technology in Management, Szczecin University, Mickiewicza 64, 71-101 Szczecin, Poland

  • Venue:
  • Software—Practice & Experience
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The innate verbosity of the extensible markup language (XML) remains one of its main weaknesses, especially when large documents are concerned. This problem can be solved with the aid of dedicated XML compression algorithms. In this work, we describe XML word-replacing transform (XML-WRT), a fast and fully reversible XML transform, which, when combined with generally used LZ77-style compression algorithms, allows to attain high compression ratios, comparable to those achieved by the current state-of-the-art XML compressors. The resulting compression scheme is asymmetric in the sense that its decoder is much faster than the coder. This is a desirable practical property, as in many XML applications data are read much more often than written. The key features of the transform are dictionary-based encoding of both document structure and content, separation of different content types into multiple streams, and dedicated encoding of specific patterns, including numbers and dates. The test results show that the proposed transform improves the XML compression efficiency of general-purpose compressors on average by 35% in case of gzip, and 17% in case of LZMA. Compared with the current state-of-the-art SCMPPM algorithm, XML-WRT with LZMA attains over 2% better compression ratio, while being 55% faster. Copyright © 2007 John Wiley & Sons, Ltd.