Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

Authors:
Timothy de Vries;Hui Ke;Sanjay Chawla;Peter Christen
Affiliations:
University of Sydney;University of Sydney;University of Sydney;The Australian National University
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2011

Citing 17
Cited 1

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Record linkage: making maximum use of the discriminating power of identifying information

Communications of the ACM
Compressed bloom filters

IEEE/ACM Transactions on Networking (TON)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
The Bloomier filter: an efficient data structure for static support lookup tables

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
A Fast Linkage Detection Scheme for Multi-Source Information Integration

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A Comparison of Personal Name Matching: Techniques and Practical Issues

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Privacy-preserving indexing of documents on the network

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Less hashing, same performance: Building a better Bloom filter

Random Structures & Algorithms
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management

A taxonomy of privacy-preserving record linkage techniques

Information Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using efficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.