Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Bioinformatics
Pairwise document similarity in large collections with MapReduce
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids
IEEE Transactions on Parallel and Distributed Systems
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Document Similarity Self-Join with MapReduce
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Efficient entity resolution for large heterogeneous information spaces
Proceedings of the fourth ACM international conference on Web search and data mining
Cloud Technologies for Bioinformatics Applications
IEEE Transactions on Parallel and Distributed Systems
Eliminating the redundancy in blocking-based entity resolution methods
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Multi-pass sorted neighborhood blocking with MapReduce
Computer Science - Research and Development
Load Balancing for MapReduce-based Entity Resolution
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
IEEE Transactions on Knowledge and Data Engineering
Dedoop: efficient deduplication with Hadoop
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
To improve the effectiveness of pair-wise similarity computation, state-of-the-art approaches assign objects to multiple overlapping clusters. This introduces redundant pair comparisons when similar objects share more than one cluster. We propose an approach that eliminates such redundant comparisons and that can be easily integrated into existing MapReduce implementations. We evaluate the approach on a real cloud infrastructure and show its effectiveness for all degrees of redundancy.