Efficient string matching: an aid to bibliographic search
Communications of the ACM
Optimizing Regular Path Expressions Using Graph Schemas
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Adding Structure to Unstructured Data
ICDT '97 Proceedings of the 6th International Conference on Database Theory
Efficient Filtering of XML Documents for Selective Dissemination of Information
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Stream processing of XPath queries with predicates
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Light-weight xPath processing of XML stream with deterministic automata
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Path sharing and predicate evaluation for high-performance XML filtering
ACM Transactions on Database Systems (TODS)
Processing XML streams with deterministic automata and stream indexes
ACM Transactions on Database Systems (TODS)
SFilter: Schema based Filtering System for XML Streams
MUE '07 Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering
XML-document-filtering automaton
Proceedings of the VLDB Endowment
Dissemination of heterogeneous XML data in publish/subscibe systems
Proceedings of the 18th ACM conference on Information and knowledge management
Energy and Latency Efficient Access of Wireless XML Stream
Journal of Database Management
A survey on XML streaming evaluation techniques
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
In a publish-subscribe system based on filtering of XML documents, subscribers specify their interests with profiles expressed in the XPath language. The system processes a stream of XML documents and delivers to subscribers a notification or content of documents that match the profiles. For filtering with profiles expressed as linear XPath queries, automaton-based approaches exist where the intractable size growth of a preconstructed deterministic finite automaton is avoided by using a nondeterministic automaton. In this article we examine how these general approaches, which do not assume the existence of any specific schema or document type definition (DTD), might benefit from the knowledge that all the XML documents to be filtered obey a given DTD. We present an algorithm that utilizes the DTD in the preprocessing phase of the filtering automaton to prune out descendant operators (//) and wildcards (*) from the linear XPath filters. Experiments with data obtained from the XML Data Repository of the Univ. of Washington indicate that filter pruning can increase the throughput of the nondeterministic YFilter automaton by Diao et al. by a factor of 2 to 20. We also present a new filtering algorithm that is based on a backtracking deterministic finite automaton derived from the classic Aho--Corasick pattern-matching automaton. This automaton has a size linear in the sum of the sizes of the filters. For our algorithm, we obtained a throughput of 15 MB/sec for filters pruned from one million original filters (with all wildcards and non-leading descendant operators eliminated), representing an improvement by a factor of 2 to 3 upon the throughput of YFilter.