Ingredients for accurate, fast, and robust XML similarity joins

Authors:
Leonardo Andrade Ribeiro;Theo Härder
Affiliations:
Department of Computer Science, Federal University of Lavras, Brazil;AG DBIS, Department of Computer Science, University of Kaiserslautern, Germany
Venue:
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Year:
2011

Citing 12
Cited 0

The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Integrating XML data sources using approximate joins

ACM Transactions on Database Systems (TODS)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Evaluating Performance and Quality of XML-Based Similarity Joins

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
A methodology for clustering XML documents by structure

Information Systems
Generalizing prefix filtering to improve set similarity joins

Information Systems
An optimal decomposition algorithm for tree edit distance

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of answering similarity join queries on large, non-schematic, heterogeneous XML datasets. Realizing similarity joins on such datasets is challenging, because the semi-structured nature of XML substantially increases the complexity of the underlying similarity function in terms of both effectiveness and efficiency. Moreover, even the selection of pieces of information for similarity assessment is complicated because these can appear at different parts among documents in a dataset. In this paper, we present an approach that jointly calculates textual and structural similarity of XML trees while implicitly embedding similarity selection into join processing. We validate the accuracy, performance, and scalability of our techniques with a set of experiments in the context of an XML DBMS.