Don't match twice: redundancy-free similarity computation with MapReduce

Authors:
Lars Kolb;Andreas Thor;Erhard Rahm
Affiliations:
University of Leipzig;University of Leipzig;University of Leipzig
Venue:
Proceedings of the Second Workshop on Data Analytics in the Cloud
Year:
2013

Citing 14
Cited 0

Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
CloudBurst

Bioinformatics
Pairwise document similarity in large collections with MapReduce

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids

IEEE Transactions on Parallel and Distributed Systems
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Document Similarity Self-Join with MapReduce

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining
Cloud Technologies for Bioinformatics Applications

IEEE Transactions on Parallel and Distributed Systems
Eliminating the redundancy in blocking-based entity resolution methods

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Multi-pass sorted neighborhood blocking with MapReduce

Computer Science - Research and Development
Load Balancing for MapReduce-based Entity Resolution

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering
Dedoop: efficient deduplication with Hadoop

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

To improve the effectiveness of pair-wise similarity computation, state-of-the-art approaches assign objects to multiple overlapping clusters. This introduces redundant pair comparisons when similar objects share more than one cluster. We propose an approach that eliminates such redundant comparisons and that can be easily integrated into existing MapReduce implementations. We evaluate the approach on a real cloud infrastructure and show its effectiveness for all degrees of redundancy.