An Introduction to Duplicate Detection

Authors:
Felix Naumann;Melanie Herschel
Affiliations:
-;-
Venue:
An Introduction to Duplicate Detection
Year:
2010

Citing 0
Cited 23

From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Entity Resolution and Information Quality

Entity Resolution and Information Quality
SemGen: towards a semantic data generator for benchmarking duplicate detectors

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Learning phenotype mapping for integrating large genetic data

BioNLP '11 Proceedings of BioNLP 2011 Workshop
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Query relaxation for entity-relationship search

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Detecting bug duplicate reports through local references

Proceedings of the 7th International Conference on Predictive Models in Software Engineering
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Efficient similarity search: arbitrary similarity measures, arbitrary composition

Proceedings of the 20th ACM international conference on Information and knowledge management
Black swan: augmenting statistics with event data

Proceedings of the 20th ACM international conference on Information and knowledge management
Instance-based 'one-to-some' assignment of similarity measures to attributes

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part I
The impact of spelling errors on patent search

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
An effective rule miner for instance matching in a web of data

Proceedings of the 21st ACM international conference on Information and knowledge management
Evaluating indeterministic duplicate detection results

SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
Indeterministic Handling of Uncertain Decisions in Deduplication

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Cost-aware query planning for similarity search

Information Systems
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Determining the relative accuracy of attributes

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Letting keys and functional dependencies out of the bag

APCCM '13 Proceedings of the Ninth Asia-Pacific Conference on Conceptual Modelling - Volume 143
Discovering linkage points over web data

Proceedings of the VLDB Endowment
Publishing bibliographic data on the Semantic Web using BibBase

Semantic Web - Linked Data for science and education

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the ever increasing volume of data, data quality problems abound. Multiple, yet different representations of the same real-world objects in data, duplicates, are one of the most intriguing data quality problems. The effects of such duplicates are detrimental; for instance, bank customers can obtain duplicate identities, inventory levels are monitored incorrectly, catalogs are mailed multiple times to the same household, etc. Automatically detecting duplicates is difficult: First, duplicate representations are usually not identical but slightly differ in their values. Second, in principle all pairs of records should be compared, which is infeasible for large volumes of data. This lecture examines closely the two main components to overcome these difficulties: (i) Similarity measures are used to automatically identify duplicates when comparing two records. Well-chosen similarity measures improve the effectiveness of duplicate detection. (ii) Algorithms are developed to perform on very large volumes of data in search for duplicates. Well-designed algorithms improve the efficiency of duplicate detection. Finally, we discuss methods to evaluate the success of duplicate detection. Table of Contents: Data Cleansing: Introduction and Motivation / Problem Definition / Similarity Functions / Duplicate Detection Algorithms / Evaluating Detection Success / Conclusion and Outlook / Bibliography