Duplicate detection through structure optimization

Authors:
Luís Leitão;Pável Calado
Affiliations:
IST/INESC-ID, Porto Salvo, Portugal;IST/INESC-ID, Porto Salvo, Portugal
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 28
Cited 1

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
The nature of statistical learning theory

The nature of statistical learning theory
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical

Advances in kernel methods
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Entity Identification in Database Integration

Proceedings of the Ninth International Conference on Data Engineering
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding similar identities among objects from multiple web sources

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Exploiting relationships for object consolidation

Proceedings of the 2nd international workshop on Information quality in information systems
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Profile-Based Object Matching for Information Integration

IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Matching XML documents in highly dynamic applications

Proceedings of the eighth ACM symposium on Document engineering
Accurate Synthetic Generation of Realistic Personal Information

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A system for detecting xml similarity in content and structure using relational database

Proceedings of the 18th ACM conference on Information and knowledge management
Efficient Techniques for Online Record Linkage

IEEE Transactions on Knowledge and Data Engineering
Object identification with attribute-mediated dependences

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting and eliminating duplicates in databases is a task of critical importance in many applications. Although solutions for traditional models, such as relational data, have been widely studied, recently there has been some focus on solutions for more complex hierarchical structures as, for instance, XML data. Such data presents many different challenges, among which is the issue of how to exploit the schema structure to determine if two objects are duplicates. In this paper, we argue that structure can indeed have a significant impact on the process of duplicate detection. We propose a novel method that automatically restructures database objects in order to take full advantage of the relations between its attributes. This new structure reflects the relative importance of the attributes in the database and avoids the need to perform a manual selection. To test our approach we applied it to an existing duplicate detection system. Experiments performed on several datasets show that, using the new learned structure, we consistently outperform both the results obtained with the original database structure and those obtained by letting a knowledgeable user manually choose the attributes to compare.