The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Modern Information Retrieval
Schema-Driven Evaluation of Approximate Tree-Pattern Queries
EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Entity Identification in Database Integration
Proceedings of the Ninth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
Duplicate Removal in Information System Dissemination
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
String Matching with Metric Trees Using an Approximate Distance
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Data quality awareness: a case study for cost optimal association rule mining
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Structure-based inference of xml similarity for fuzzy duplicate detection
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
ACM Computing Surveys (CSUR)
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Multiple relationship based deduplication
Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
On memory and I/O efficient duplication detection for multiple self-clean data sources
DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
A multilevel and domain-independent duplicate detection model for scientific database
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Enforcing strictness in integration of dimensions: beyond instance matching
Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
XML duplicate detection using sorted neighborhoods
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Data-driven matching of geospatial schemas
COSIT'05 Proceedings of the 2005 international conference on Spatial Information Theory
Improving XML instances comparison with preprocessing algorithms
DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications
Hi-index | 0.00 |
The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.In this paper, we present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level. Pairs of duplicate elements are detected using a thresholded similarity function, and are then clustered by computing the transitive closure. To minimize the number of pairwise element comparisons, an appropriate filter function is used. The similarity measure involves string similarity for pairs of strings, which is measured using their edit distance. To increase efficiency, we avoid the computation of edit distance for pairs of strings using three filtering methods subsequently. First experiments show that our approach detects XML duplicates accurately and efficiently.