2LP: A double-lazy XML parser

  • Authors:
  • Fernando Farfán;Vagelis Hristidis;Raju Rangaswami

  • Affiliations:
  • School of Computing and Information Sciences, Florida International University, 11200 SW 8th Street, Miami, FL 33199, United States;School of Computing and Information Sciences, Florida International University, 11200 SW 8th Street, Miami, FL 33199, United States;School of Computing and Information Sciences, Florida International University, 11200 SW 8th Street, Miami, FL 33199, United States

  • Venue:
  • Information Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML is acknowledged as the most effective format for data encoding and exchange over domains ranging from the World Wide Web to desktop applications. However, large-scale adoption into actual system implementations is being slowed down due to the inefficiency of its document-parsing methods. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have a key drawback-they must load the entire XML document in order to extract the overall document structure before document parsing can be performed. We have developed a framework for efficient parsing based on the idea of placing internal physical pointers within the XML document that allow the navigation process to skip large portions of the document during parsing. We show how to generate such internal pointers in a way that optimizes parsing using constructs supported by the current W3C XML standard. A double-lazy parser (2LP) exploits these internal pointers to efficiently parse the document. The usage of supported W3C constructs to create internal pointers allows 2LP to be backward compatible-i.e., the pointer-augmented documents can be parsed by current XML parsers. We also implemented a mechanism to efficiently parse large documents with limited main memory, thereby overcoming a major limitation in current solutions. We study our pointer generation and parsing algorithms both theoretically and experimentally, and show that they perform considerably better than existing approaches.