Social content matching in MapReduce
Proceedings of the VLDB Endowment
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
MapReduce algorithms for big data analysis
Proceedings of the VLDB Endowment
Optimizing parallel algorithms for all pairs similarity search
Proceedings of the sixth ACM international conference on Web search and data mining
Don't match twice: redundancy-free similarity computation with MapReduce
Proceedings of the Second Workshop on Data Analytics in the Cloud
Scalable all-pairs similarity search in metric spaces
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Penguins in sweaters, or serendipitous entity search on user-generated content
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Dimension independent similarity computation
The Journal of Machine Learning Research
Hi-index | 0.00 |
iven a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.