Record linkage performance for large data sets

Authors:
Jordi Gómez-Bao;Josep-L. Larriba-Pey;Josepa Ribes Puig
Affiliations:
Universitat Politècnica de Catalunya, Barcelona, Spain;Universitat Politècnica de Catalunya, Barcelona, Spain;Pla Director d'Oncologia de Catalunya, l'Hospitalet de Llogregat, Spain
Venue:
Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Year:
2009

Citing 9
Cited 2

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A fast filtering scheme for large database cleansing

Proceedings of the eleventh international conference on Information and knowledge management
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Privacy-preserving data integration and sharing

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Similarity-aware indexing for real-time entity resolution

Proceedings of the 18th ACM conference on Information and knowledge management
Development and user experiences of an open source data cleaning, deduplication and record linkage system

ACM SIGKDD Explorations Newsletter

De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose new data structures to speed up Record Linkage that take advantage of the value distribution of usual string attributes, like name or surname. Using some additional memory, we increase the processing speed by almost an order of magnitude without losing recall or precision at all. The improvement achieved is independent from the methods used for reducing the number of record comparisons, like Blocking or Sliding Window, and the specific string comparison functions.