Robust record linkage blocking using suffix arrays

Authors:
Timothy de Vries;Hui Ke;Sanjay Chawla;Peter Christen
Affiliations:
University of Sydney, Sydney, NSW, Australia;University of Sydney, Sydney, NSW, Australia;University of Sydney, Sydney, NSW, Australia;Australian National University, Canberra, ACT, Australia
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 10
Cited 10

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Record linkage: making maximum use of the discriminating power of identifying information

Communications of the ACM
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
TAILOR: A Record Linkage Tool Box

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
A Fast Linkage Detection Scheme for Multi-Source Information Integration

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Achieving both high precision and high recall in near-duplicate detection

Proceedings of the 17th ACM conference on Information and knowledge management
Large-Scale Deduplication with Constraints Using Dedupalog

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering

Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters

ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
A fast approach for parallel deduplication on multicore processors

Proceedings of the 2011 ACM Symposium on Applied Computing
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Detecting and exploiting stability in evolving heterogeneous information spaces

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient

Proceedings of the International Workshop on Semantic Web Information Management
Web trace duplication detection based on context

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Proceedings of the fifth ACM international conference on Web search and data mining
Efficient indexing techniques for record matching and deduplication

International Journal of Computational Vision and Robotics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.