Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Record linkage: making maximum use of the discriminating power of identifying information
Communications of the ACM
IEEE/ACM Transactions on Networking (TON)
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
The Bloomier filter: an efficient data structure for static support lookup tables
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
A Fast Linkage Detection Scheme for Multi-Source Information Integration
WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Adaptive sorted neighborhood methods for efficient record linkage
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A Comparison of Personal Name Matching: Techniques and Practical Issues
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Privacy-preserving indexing of documents on the network
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Less hashing, same performance: Building a better Bloom filter
Random Structures & Algorithms
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Robust record linkage blocking using suffix arrays
Proceedings of the 18th ACM conference on Information and knowledge management
A taxonomy of privacy-preserving record linkage techniques
Information Systems
Hi-index | 0.01 |
Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, quadratic scalability for the brute force approach of comparing all possible pairs of records necessitates the design of appropriate indexing or blocking techniques. The aim of these techniques is to cheaply remove candidate record pairs that are unlikely to match. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlight the importance of using efficient indexing and blocking in real-world applications where datasets contain millions of records. We extend our disk-based methods with the capability to utilise main memory based storage to construct Bloom filters, which we have found to cause significant speedup by reducing the number of costly database queries by up to 70% in real data. We give practical implementation details and show how Bloom filters can be easily applied to Suffix Array based indexing.