A Data Parallel Algorithm for XML DOM Parsing

  • Authors:
  • Bhavik Shah;Praveen R. Rao;Bongki Moon;Mohan Rajagopalan

  • Affiliations:
  • University of Missouri-Kansas City,;University of Missouri-Kansas City,;University of Arizona,;Intel Research Labs,

  • Venue:
  • XSym '09 Proceedings of the 6th International XML Database Symposium on Database and XML Technologies
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The extensible markup language XML has become the de facto standard for information representation and interchange on the Internet. XML parsing is a core operation performed on an XML document for it to be accessed and manipulated. This operation is known to cause performance bottlenecks in applications and systems that process large volumes of XML data. We believe that parallelism is a natural way to boost performance. Leveraging multicore processors can offer a cost-effective solution, because future multicore processors will support hundreds of cores, and will offer a high degree of parallelism in hardware. We propose a data parallel algorithm called ParDOM for XML DOM parsing, that builds an in-memory tree structure for an XML document. ParDOM has two phases. In the first phase, an XML document is partitioned into chunks and parsed in parallel. In the second phase, partial DOM node tree structures created during the first phase, are linked together (in parallel) to build a complete DOM node tree. ParDOM offers fine-grained parallelism by adopting a flexible chunking scheme --- each chunk can contain an arbitrary number of start and end XML tags that are not necessarily matched. ParDOM can be conveniently implemented using a data parallel programming model that supports map and sort operations. Through empirical evaluation, we show that ParDOM yields better scalability than PXP [23] --- a recently proposed parallel DOM parsing algorithm --- on commodity multicore processors. Furthermore, ParDOM can process a wide-variety of XML datasets with complex structures which PXP fails to parse.