SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration
WWW '03 Proceedings of the 12th international conference on World Wide Web
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Adaptive Name Matching in Information Integration
IEEE Intelligent Systems
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Constraint-based entity matching
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Disambiguating authors in academic publications using random forests
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Generating data quality rules and integration into ETL process
Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP
HARRA: fast iterative hashed record linkage for large-scale data collections
Proceedings of the 13th International Conference on Extending Database Technology
A fast approach for parallel deduplication on multicore processors
Proceedings of the 2011 ACM Symposium on Applied Computing
Learning-based entity resolution with MapReduce
Proceedings of the third international workshop on Cloud data management
Multi-pass sorted neighborhood blocking with MapReduce
Computer Science - Research and Development
Flexible and efficient distributed resolution of large entities
FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems
An automatic blocking mechanism for large-scale de-duplication tasks
Proceedings of the 21st ACM international conference on Information and knowledge management
Data & Knowledge Engineering
Hi-index | 0.00 |
We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56% improvement in efficiency compared to P-Swoosh.