Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
IEEE Transactions on Knowledge and Data Engineering
YALE: rapid prototyping for complex data mining tasks
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Example-driven design of efficient record matching queries
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pairwise document similarity in large collections with MapReduce
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapDupReducer: detecting near duplicates over massive datasets
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
SystemML: Declarative machine learning on MapReduce
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Multi-pass sorted neighborhood blocking with MapReduce
Computer Science - Research and Development
Entity matching for semistructured data in the Cloud
Proceedings of the 27th Annual ACM Symposium on Applied Computing
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
Hi-index | 0.00 |
Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infrastructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similarity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effectiveness of the proposed approaches.