Multi-pass sorted neighborhood blocking with MapReduce

Authors:
Lars Kolb;Andreas Thor;Erhard Rahm
Affiliations:
Institut für Informatik, Fakultät für Mathematik und Informatik, Universität Leipzig, Leipzig, Germany 04009;Institut für Informatik, Fakultät für Mathematik und Informatik, Universität Leipzig, Leipzig, Germany 04009;Institut für Informatik, Fakultät für Mathematik und Informatik, Universität Leipzig, Leipzig, Germany 04009
Venue:
Computer Science - Research and Development
Year:
2012

Citing 15
Cited 4

Parallel database systems: the future of high performance database systems

Communications of the ACM
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Learning-Based Approaches for Matching Web Data Entities

IEEE Internet Computing
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment

Block-based load balancing for entity resolution with MapReduce

Proceedings of the 20th ACM international conference on Information and knowledge management
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Dedoop: efficient deduplication with Hadoop

Proceedings of the VLDB Endowment
Don't match twice: redundancy-free similarity computation with MapReduce

Proceedings of the Second Workshop on Data Analytics in the Cloud

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReduce-based implementations for single- and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitioning approach for multi-pass SN to achieve load balancing. Our evaluation based on real-world datasets shows the high efficiency and effectiveness of the proposed approaches.