XCQ: A queriable XML compression system

  • Authors:
  • Wilfred Ng;Wai-Yeung Lam;Peter T. Wood;Mark Levene

  • Affiliations:
  • Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China;Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China;School of Computer Science and Information Systems, Birkbeck, University of London, Malet Street, London, Hong Kong, UK;School of Computer Science and Information Systems, Birkbeck, University of London, Malet Street, London, Hong Kong, UK

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML has already become the de facto standard for specifying and exchanging data on the Web. However, XML is by nature verbose and thus XML documents are usually large in size, a factor that hinders its practical usage, since it substantially increases the costs of storing, processing, and exchanging data. In order to tackle this problem, many XML-specific compression systems, such as XMill, XGrind, XMLPPM, and Millau, have recently been proposed. However, these systems usually suffer from the following two inadequacies: They either sacrifice performance in terms of compression ratio and execution time in order to support a limited range of queries, or perform full decompression prior to processing queries over compressed documents.In this paper, we address the above problems by exploiting the information provided by a Document Type Definition (DTD) associated with an XML document. We show that a DTD is able to facilitate better compression as well as generate more usable compressed data to support querying. We present the architecture of the XCQ, which is a compression and querying tool for handling XML data. XCQ is based on a novel technique we have developed called DTD Tree and SAX Event Stream Parsing (DSP). The documents compressed by XCQ are stored in Partitioned Path-Based Grouping (PPG) data streams, which are equipped with a Block Statistics Signature (BSS) indexing scheme. The indexed PPG data streams support the processing of XML queries that involve selection and aggregation, without the need for full decompression. In order to study the compression performance of XCQ, we carry out comprehensive experiments over a set of XML benchmark datasets.