Efficient and Effective Duplicate Detection in Hierarchical Data

Authors:
Luis Leitao;Pavel Calado;Melanie Herschel
Affiliations:
IST/INESC-ID, Lisbon;IST/INESC-ID, Lisbon;University of Tübingen, Tübingen
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2013

Citing 0
Cited 3

Efficient XML duplicate detection using an adaptive two-level optimization

Proceedings of the 28th Annual ACM Symposium on Applied Computing
An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review
SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although there is a long line of work on identifying duplicates in relational data, only a few solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several data sets. XMLDup is also able to outperform another state-of-the-art duplicate detection solution, both in terms of efficiency and of effectiveness.