The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Identity uncertainty
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Efficient algorithms for substring near neighbor problem
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions
FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Collective entity resolution in relational data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Multi-probe LSH: efficient indexing for high-dimensional similarity search
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Nearest Neighbor Retrieval Using Distance-Based Hashing
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Constraint-based entity matching
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Eliminating the redundancy in blocking-based entity resolution methods
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient
Proceedings of the International Workshop on Semantic Web Information Management
Efficient duplicate detection on cloud using a new signature scheme
WAIM'11 Proceedings of the 12th international conference on Web-age information management
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data
Proceedings of the fifth ACM international conference on Web search and data mining
Leveraging unlabeled data to scale blocking for record linkage
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Efficient privacy-aware record integration
Proceedings of the 16th International Conference on Extending Database Technology
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage
Proceedings of the 6th Balkan Conference in Informatics
Hi-index | 0.00 |
We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).