Learning-based entity resolution with MapReduce

Authors:
Lars Kolb;Hanna Köpcke;Andreas Thor;Erhard Rahm
Affiliations:
University of Leipzig, Leipzig, Germany;University of Leipzig, Leipzig, Germany;University of Leipzig, Leipzig, Germany;University of Leipzig, Leipzig, Germany
Venue:
Proceedings of the third international workshop on Cloud data management
Year:
2011

Citing 15
Cited 2

Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
YALE: rapid prototyping for complex data mining tasks

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapDupReducer: detecting near duplicates over massive datasets

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
SystemML: Declarative machine learning on MapReduce

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Multi-pass sorted neighborhood blocking with MapReduce

Computer Science - Research and Development

Entity matching for semistructured data in the Cloud

Proceedings of the 27th Annual ACM Symposium on Applied Computing
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infrastructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similarity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effectiveness of the proposed approaches.