A MapReduce-based filtering algorithm for vector similarity join

Authors:
Byoungju Yang;Jaeseok Myung;Sang-goo Lee;Dongjoo Lee
Affiliations:
Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Seoul National University, Seoul, Korea;Samsung Electronics Co., Ltd., Suwon, Korea
Venue:
Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Year:
2013

Citing 4
Cited 0

Scalable Recognition with a Vocabulary Tree

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Vector Similarity Join is a fundamental operation that is utilized in data cleaning and analysis. Since most objects can be represented as feature vectors, finding similar pairs of objects is quite an important task. However, Vector Similarity Join is a heavy computational job, because its complexity is proportional to the square of the number of vectors. In order to diminish its computational load, many filtering techniques have been proposed so far. In addition to that, algorithms for distributed systems also have been researched to manage large datasets. But, the state-of-the-art studies also suffer from voluminous computations. In this paper, we propose a MapReduce algorithm that efficiently executes Vector Similarity Join. In the first stage of our algorithm, we use prefix filtering to reduce the number of candidate pairs. The second stage calculates similarities from candidate pairs of the first stage. We present candidates quantity prediction formulas to demonstrate the effectiveness of our algorithm. Experimental results show that our algorithm outperforms state-of-the-art MapReduce algorithms.