Schema-conscious filtering of XML documents

Authors:
Panu Silvasti;Seppo Sippu;Eljas Soisalon-Soininen
Affiliations:
Helsinki University of Technology;University of Helsinki;Helsinki University of Technology
Venue:
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Year:
2009

Citing 10
Cited 3

Efficient string matching: an aid to bibliographic search

Communications of the ACM
Optimizing Regular Path Expressions Using Graph Schemas

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Adding Structure to Unstructured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Efficient Filtering of XML Documents for Selective Dissemination of Information

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Stream processing of XPath queries with predicates

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Light-weight xPath processing of XML stream with deterministic automata

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Path sharing and predicate evaluation for high-performance XML filtering

ACM Transactions on Database Systems (TODS)
Processing XML streams with deterministic automata and stream indexes

ACM Transactions on Database Systems (TODS)
SFilter: Schema based Filtering System for XML Streams

MUE '07 Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering
XML-document-filtering automaton

Proceedings of the VLDB Endowment

Dissemination of heterogeneous XML data in publish/subscibe systems

Proceedings of the 18th ACM conference on Information and knowledge management
Energy and Latency Efficient Access of Wireless XML Stream

Journal of Database Management
A survey on XML streaming evaluation techniques

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In a publish-subscribe system based on filtering of XML documents, subscribers specify their interests with profiles expressed in the XPath language. The system processes a stream of XML documents and delivers to subscribers a notification or content of documents that match the profiles. For filtering with profiles expressed as linear XPath queries, automaton-based approaches exist where the intractable size growth of a preconstructed deterministic finite automaton is avoided by using a nondeterministic automaton. In this article we examine how these general approaches, which do not assume the existence of any specific schema or document type definition (DTD), might benefit from the knowledge that all the XML documents to be filtered obey a given DTD. We present an algorithm that utilizes the DTD in the preprocessing phase of the filtering automaton to prune out descendant operators (//) and wildcards (*) from the linear XPath filters. Experiments with data obtained from the XML Data Repository of the Univ. of Washington indicate that filter pruning can increase the throughput of the nondeterministic YFilter automaton by Diao et al. by a factor of 2 to 20. We also present a new filtering algorithm that is based on a backtracking deterministic finite automaton derived from the classic Aho--Corasick pattern-matching automaton. This automaton has a size linear in the sum of the sizes of the filters. For our algorithm, we obtained a throughput of 15 MB/sec for filters pruned from one million original filters (with all wildcards and non-leading descendant operators eliminated), representing an improvement by a factor of 2 to 3 upon the throughput of YFilter.