Document Similarity Self-Join with MapReduce

Authors:
Ranieri Baraglia;Gianmarco De Francisci Morales;Claudio Lucchese
Affiliations:
-;-;-
Venue:
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Year:
2010

Citing 0
Cited 8

Social content matching in MapReduce

Proceedings of the VLDB Endowment
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
Optimizing parallel algorithms for all pairs similarity search

Proceedings of the sixth ACM international conference on Web search and data mining
Don't match twice: redundancy-free similarity computation with MapReduce

Proceedings of the Second Workshop on Data Analytics in the Cloud
Scalable all-pairs similarity search in metric spaces

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Penguins in sweaters, or serendipitous entity search on user-generated content

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Dimension independent similarity computation

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

iven a collection of objects, the Similarity Self-Join problem requires to discover all those pairs of objects whose similarity is above a user defined threshold. In this paper we focus on document collections, which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. This work borrows from the state of the art in serial algorithms for similarity join and MapReduce-based techniques for set-similarity join. The proposed algorithm shows that it is possible to leverage a distributed file system to support communication patterns that do not naturally fit the MapReduce framework. Scalability is achieved by introducing a partitioning strategy able to overcome memory bottlenecks. Experimental evidence on real world data shows that our algorithm outperforms the state of the art by a factor 4.5.