Using a relational database for scalable XML search

  • Authors:
  • Rebecca J. Cathey;Steven M. Beitzel;Eric C. Jensen;David Grossman;Ophir Frieder

  • Affiliations:
  • Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, Chicago, USA 60616;Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, Chicago, USA 60616;Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, Chicago, USA 60616;Information Retrieval Laboratory, Department of Computer Science, Illinois Institute of Technology, Chicago, USA 60616;Department of Computer Science, Georgetown University and IIT, Washington, USA 20057

  • Venue:
  • The Journal of Supercomputing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML is a flexible and powerful tool that enables information and security sharing in heterogeneous environments. Scalable technologies are needed to effectively manage the growing volumes of XML data. A wide variety of methods exist for storing and searching XML data; the two most common techniques are conventional tree-based and relational approaches. Tree-based approaches represent XML as a tree and use indexes and path join algorithms to process queries. In contrast, the relational approach utilizes the power of a mature relational database to store and search XML. This method relationally maps XML queries to SQL and reconstructs the XML from the database results. To date, the limited acceptance of the relational approach to XML processing is due to the need to redesign the relational schema each time a new XML hierarchy is defined. We, in contrast, describe a relational approach that is fixed schema eliminating the need for schema redesign at the expense of potentially longer runtimes. We show, however, that these potentially longer runtimes are still significantly shorter than those of the tree approach. We use a popular XML benchmark to compare the scalability of both approaches. We generated large collections of heterogeneous XML documents ranging in size from 500 MB to 8 GB using the XBench benchmark. The scalability of each method was measured by running XML queries that cover a wide range of XML search features on each collection. We measure the scalability of each method over different query features as the collection size increases. In addition, we examine the performance of each method as the result size and the number of predicates increase. Our results show that our relational approach provides a scalable approach to XML retrieval by leveraging existing relational database optimizations. Furthermore, we show that the relational approach typically outperforms the tree-based approach while scaling consistently over all collections studied.