HARRA: fast iterative hashed record linkage for large-scale data collections

Authors:
Hung-sik Kim;Dongwon Lee
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 13th International Conference on Extending Database Technology
Year:
2010

Citing 27
Cited 7

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Identity uncertainty

Identity uncertainty
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Efficient algorithms for substring near neighbor problem

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

World Wide Web
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Nearest Neighbor Retrieval Using Distance-Based Hashing

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Constraint-based entity matching

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2

Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
To compare or not to compare: making entity resolution more efficient

Proceedings of the International Workshop on Semantic Web Information Management
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Proceedings of the fifth ACM international conference on Web search and data mining
Leveraging unlabeled data to scale blocking for record linkage

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Efficient privacy-aware record integration

Proceedings of the 16th International Conference on Extending Database Technology
A distributed framework for scaling Up LSH-based computations in privacy preserving record linkage

Proceedings of the 6th Balkan Conference in Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).