XML-document-filtering automaton

  • Authors:
  • Panu Silvasti;Seppo Sippu;Eljas Soisalon-Soininen

  • Affiliations:
  • Helsinki University of Technology;-;-

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In a publish-subscribe system based on filtering of XML documents subscribers specify their interests with profiles expressed in the XPath language. The system processes a stream of XML documents and delivers to subscribers a notification or content of documents that match the profiles. We present a new XML-document-filtering algorithm that is based on the classic Aho-Corasick pattern-matching automaton. The automaton has a size linear in the sum of the sizes of the filters. We assume that the XML documents all conform to a given DTD; our algorithm utilizes the DTD in the preprocessing phase of the automaton to prune out descendant axes (//) and wildcards (*) from the XPath filters. The XPath subset currently supported consists of linear XPath expressions without predicates. In the case of a 683 MB protein-sequence database, we obtained a throughput of 18.8 MB/sec for 50 000 filters and 17.0 MB/sec for 500 000 filters, using a SAX parser with a throughput of 27 MB/sec.