Efficient Record Linkage in Large Data Sets

Authors:
Liang Jin;Chen Li;Sharad Mehrotra
Affiliations:
-;-;-
Venue:
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Year:
2003

Citing 0
Cited 40

Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
Web data integration using approximate string join

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Methods for evaluating and creating data quality

Information Systems - Special issue: Data quality in cooperative information systems
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Indexing mixed types for approximate retrieval

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Automatically utilizing secondary sources to align information across sources

AI Magazine - Special issue on semantic integration
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
Estimating the selectivity of approximate string queries

ACM Transactions on Database Systems (TODS)
Privacy preserving schema and data matching

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Structure-based inference of xml similarity for fuzzy duplicate detection

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Are your citations clean?

Communications of the ACM
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
SEPIA: estimating selectivities of approximate string predicates in large Databases

The VLDB Journal — The International Journal on Very Large Data Bases
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Disambiguating authors in academic publications using random forests

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter
Scaling record linkage to non-uniform distributed class sizes

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Generalizing prefix filtering to improve set similarity joins

Information Systems
A multilevel and domain-independent duplicate detection model for scientific database

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Evaluating entity resolution results

Proceedings of the VLDB Endowment
Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
Detecting and exploiting stability in evolving heterogeneous information spaces

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Integrating large and distributed life sciences resources for systems biology research: progress and new challenges

Transactions on large-scale data- and knowledge-centered systems III
PG-join: proximity graph based string similarity joins

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Effective early termination techniques for text similarity join operator

ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Decision models for record linkage

Data Mining
Multiple valued logic approach for matching patient records in multiple databases

Journal of Biomedical Informatics
Efficient and Practical Approach for Private Record Linkage

Journal of Data and Information Quality (JDIQ)
An automatic blocking mechanism for large-scale de-duplication tasks

Proceedings of the 21st ACM international conference on Information and knowledge management
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution
Towards scalable real-time entity resolution using a similarity-aware inverted index approach

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
A taxonomy of privacy-preserving record linkage techniques

Information Systems
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage

Proceedings of the 6th Balkan Conference in Informatics

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problemconsists of determining all pairs that are similar to eachother, where the overall similarity between two records isdefined based on domain-specific similarities over individual attributes constituting the record. The record-linkageproblem arises naturally in the context of data cleansingthat usually precedes data analysis and mining. We explore a novel approach to this problem. For each attribute of records, we first map values to a multidimensionalEuclidean space that preserves domain-specific similarity.Many mapping algorithms can be applied, and we use theFastMap approach as an example. Given the merging rulethat defines when two records are similar, a set of attributesare chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is usedto determine similar pairs of records. Our extensive experiments using real data sets show that our solution has verygood efficiency and accuracy.