XML duplicate detection using sorted neighborhoods

Authors:
Sven Puhlmann;Melanie Weis;Felix Naumann
Affiliations:
Humboldt-Universität zu Berlin, Berlin, Germany;Humboldt-Universität zu Berlin, Berlin, Germany;Humboldt-Universität zu Berlin, Berlin, Germany
Venue:
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Year:
2006

Citing 11
Cited 9

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Approximate XML joins

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Entity Identification in Database Integration

Proceedings of the Ninth International Conference on Data Engineering
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Profile-Based Object Matching for Information Integration

IEEE Intelligent Systems
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A precise blocking method for record linkage

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery

Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Retrieving XML data from heterogeneous sources through vague querying

ACM Transactions on Internet Technology (TOIT)
Detecting Aggregate Incongruities in XML

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
The pq-gram distance between ordered labeled trees

ACM Transactions on Database Systems (TODS)
Graph-based concept identification and disambiguation for enterprise search

Proceedings of the 19th international conference on World wide web
XML: some papers in a haystack

ACM SIGMOD Record
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
Efficient XML duplicate detection using an adaptive two-level optimization

Proceedings of the 28th Annual ACM Symposium on Applied Computing
An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review

Quantified Score

Hi-index	0.02

Visualization

Abstract

Detecting duplicates is a problem with a long tradition in many domains, such as customer relationship management and data warehousing. The problem is twofold: First define a suitable similarity measure, and second efficiently apply the measure to all pairs of objects. With the advent and pervasion of the XML data model, it is necessary to find new similarity measures and to develop efficient methods to detect duplicate elements in nested XML data. A classical approach to duplicate detection in flat relational data is the sorted neighborhood method, which draws its efficiency from sliding a window over the relation and comparing only tuples within that window. We extend the algorithm to cover not only a single relation but nested XML elements. To compare objects we make use of XML parent and child relationships. For efficiency, we apply the windowing technique in a bottom-up fashion, detecting duplicates at each level of the XML hierarchy. Experiments show a speedup comparable to the original method data and they show the high effectiveness of our algorithm in detecting XML duplicates.