Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Record linkage: making maximum use of the discriminating power of identifying information
Communications of the ACM
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
TAILOR: A Record Linkage Tool Box
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
A Fast Linkage Detection Scheme for Multi-Source Information Integration
WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
Adaptive sorted neighborhood methods for efficient record linkage
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Achieving both high precision and high recall in near-duplicate detection
Proceedings of the 17th ACM conference on Information and knowledge management
Large-Scale Deduplication with Constraints Using Dedupalog
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters
ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient entity resolution for large heterogeneous information spaces
Proceedings of the fourth ACM international conference on Web search and data mining
A fast approach for parallel deduplication on multicore processors
Proceedings of the 2011 ACM Symposium on Applied Computing
Interaction between record matching and data repairing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Eliminating the redundancy in blocking-based entity resolution methods
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Detecting and exploiting stability in evolving heterogeneous information spaces
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient
Proceedings of the International Workshop on Semantic Web Information Management
Web trace duplication detection based on context
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data
Proceedings of the fifth ACM international conference on Web search and data mining
Efficient indexing techniques for record matching and deduplication
International Journal of Computational Vision and Robotics
Hi-index | 0.00 |
Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.