Unsupervised duplicate detection using sample non-duplicates

Authors:
Patrick Lehti;Peter Fankhauser
Affiliations:
Fraunhofer IPSI, Darmstadt, Germany;Fraunhofer IPSI, Darmstadt, Germany
Venue:
Journal on Data Semantics VII
Year:
2005

Citing 15
Cited 1

Word association norms, mutual information, and lexicography

Computational Linguistics
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval

Modern Information Retrieval
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An Extensible Framework for Data Cleaning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
A precise blocking method for record linkage

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery

The missing links: discovering hidden same-as links among a billion of triples

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Typical current methods require deep understanding of the application domain or a good representative training set, which entails significant costs. In this paper we present an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy. Evaluations show that our approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.