Parallel XML Parsing Using Meta-DFAs

Authors:
Yinfei Pan;Ying Zhang;Kenneth Chiu;Wei Lu
Affiliations:
-;-;-;-
Venue:
E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Year:
2007

Citing 0
Cited 5

Performance enhancement with speculative execution based parallelism for processing large-scale xml-based application data

Proceedings of the 18th ACM international symposium on High performance distributed computing
Studying the efficiency of XML web services for real-time applications

SENSIG'09/VIS'09/MATERIALS'09 Proceedings of the 2nd WSEAS International Conference on Sensors, and Signals and Visualization, Imaging and Simulation and Materials Science
How to improve XML web services performance?

Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Parsing XML using parallel traversal of streaming trees

HiPC'08 Proceedings of the 15th international conference on High performance computing
Designing efficient XML web services

Proceedings of the International Conference & Workshop on Emerging Trends in Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

By leveraging the growing prevalence of multicore CPUs, parallel XML parsing(PXP) can significantly improve the performance of XML, enhancing its suitability for scientific data which is often dominated by floating-point numbers. One approach is to divide the XML document into equal-sized chunks, and parse each chunk in parallel. XML parsing is inherently sequential, however, because the state of an XML parser when reading a given character depends potentially on all preceding characters. In previous work, we addressed this by using a fast preparsing scan to build an outline of the document which we called the skeleton. The skeleton is then used to guide the parallel full parse. The preparse is a sequential phase that limits scalability, however, and so in this paper, we show how the preparse itself can be parallelized using a mechanism we call a meta-DFA. For each state q of the original preparser the meta-DFA incorporates a complete copy of the preparser state machine as a sub-DFA which starts in state q. The meta-DFA thus runs multiple instances of the preparser simultaneously when parsing a chunk, with each possible preparser state at the beginning of a chunk represented by an instance. By pursuing all possibilities simultaneously, the meta-DFA allows each chunk to be preparsed independently in parallel. The parallel full parse following the preparse is performed using libxml2, and outputs DOM trees that are fully compatible with existing applications that use libxml2. Our implementation scales well on a 30 CPU Sun E6500 machine.