Detecting duplicate objects in XML documents

Authors:
Melanie Weis;Felix Naumann
Affiliations:
Humboldt-Universität zu Berlin, Berlin, Germany;Humboldt-Universität zu Berlin, Berlin, Germany
Venue:
Proceedings of the 2004 international workshop on Information quality in information systems
Year:
2004

Citing 14
Cited 12

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Schema-Driven Evaluation of Approximate Tree-Pattern Queries

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Entity Identification in Database Integration

Proceedings of the Ninth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Duplicate Removal in Information System Dissemination

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
String Matching with Metric Trees Using an Approximate Distance

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Report from the First and Second International Workshops on Information Quality in Information Systems: IQIS 2004 and IQIS 2005 in conjunction with ACM SIGMOD/PODS Conferences

ACM SIGMOD Record
Data quality awareness: a case study for cost optimal association rule mining

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Data fusion

ACM Computing Surveys (CSUR)
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Multiple relationship based deduplication

Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
On memory and I/O efficient duplication detection for multiple self-clean data sources

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
A multilevel and domain-independent duplicate detection model for scientific database

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Enforcing strictness in integration of dimensions: beyond instance matching

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Data-driven matching of geospatial schemas

COSIT'05 Proceedings of the 2005 international conference on Spatial Information Theory
Improving XML instances comparison with preprocessing algorithms

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.