The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
A bag of paths model for measuring structural similarity in Web documents
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
DogmatiX tracks down duplicates in XML
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Integrating XML data sources using approximate joins
ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Evaluating Performance and Quality of XML-Based Similarity Joins
ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Approximate Joins for Data-Centric XML
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A cluster-based approach to XML similarity joins
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
The pq-gram distance between ordered labeled trees
ACM Transactions on Database Systems (TODS)
A methodology for clustering XML documents by structure
Information Systems
Generalizing prefix filtering to improve set similarity joins
Information Systems
An optimal decomposition algorithm for tree edit distance
ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Hi-index | 0.00 |
We consider the problem of answering similarity join queries on large, non-schematic, heterogeneous XML datasets. Realizing similarity joins on such datasets is challenging, because the semi-structured nature of XML substantially increases the complexity of the underlying similarity function in terms of both effectiveness and efficiency. Moreover, even the selection of pieces of information for similarity assessment is complicated because these can appear at different parts among documents in a dataset. In this paper, we present an approach that jointly calculates textual and structural similarity of XML trees while implicitly embedding similarity selection into join processing. We validate the accuracy, performance, and scalability of our techniques with a set of experiments in the context of an XML DBMS.