Fast answering of XPath query workloads on web collections

  • Authors:
  • Mariano P. Consens;Flavio Rizzolo

  • Affiliations:
  • University of Toronto;University of Toronto

  • Venue:
  • XSym'07 Proceedings of the 5th international conference on Database and XML Technologies
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges. This paper introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, enabling the efficient evaluation of XPath workloads (supporting all the axes and language constructs in XPath). Experiments validate that DescribeX enables existing document-at-a-time XPath tools to scale up to multi-gigabyte XML collections.