A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Authors:
Peter Christen
Affiliations:
The Australian National University, Canberra
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2012

Citing 0
Cited 28

Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Proceedings of the fifth ACM international conference on Web search and data mining
Fake injection strategies for private phonetic matching

DPM'11 Proceedings of the 6th international conference, and 4th international conference on Data Privacy Management and Autonomous Spontaneus Security
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
Reference table based k-anonymous private blocking

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Efficient and Practical Approach for Private Record Linkage

Journal of Data and Information Quality (JDIQ)
Multiple instance learning for group record linkage

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
CrowdER: crowdsourcing entity resolution

Proceedings of the VLDB Endowment
Matching product titles using web-based enrichment

Proceedings of the 21st ACM international conference on Information and knowledge management
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
An automatic blocking mechanism for large-scale de-duplication tasks

Proceedings of the 21st ACM international conference on Information and knowledge management
Scalable and domain-independent entity coreference: establishing high quality data linkages across heterogeneous data sources

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part II
NADEEF: a commodity data cleaning system

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
A taxonomy of privacy-preserving record linkage techniques

Information Systems
Efficient XML duplicate detection using an adaptive two-level optimization

Proceedings of the 28th Annual ACM Symposium on Applied Computing
An efficient two-party protocol for approximate matching in private record linkage

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Don't match twice: redundancy-free similarity computation with MapReduce

Proceedings of the Second Workshop on Data Analytics in the Cloud
SIGMa: simple greedy matching for aligning large knowledge bases

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage

Proceedings of the 6th Balkan Conference in Informatics
Active Sampling for Entity Matching with Guarantees

ACM Transactions on Knowledge Discovery from Data (TKDD) - Special Issue on ACM SIGKDD 2012
An automatic blocking strategy for XML duplicate detection

ACM SIGAPP Applied Computing Review
Efficient two-party private blocking based on sorted nearest neighborhood clustering

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
FusionDB: conflict management system for small-science databases

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
An iterative two-party protocol for scalable privacy-preserving record linkage

AusDM '12 Proceedings of the Tenth Australasian Data Mining Conference - Volume 134
Large-scale linked data integration using probabilistic reasoning and crowdsourcing

The VLDB Journal — The International Journal on Very Large Data Bases
Toward detection of aliases without string similarity

Information Sciences: an International Journal
Efficient indexing techniques for record matching and deduplication

International Journal of Computational Vision and Robotics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.