Detecting duplicate objects in XML documents

  • Authors:
  • Melanie Weis;Felix Naumann

  • Affiliations:
  • Humboldt-Universität zu Berlin, Berlin, Germany;Humboldt-Universität zu Berlin, Berlin, Germany

  • Venue:
  • Proceedings of the 2004 international workshop on Information quality in information systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.