Efficient processing of k nearest neighbor joins using MapReduce

Authors:
Wei Lu;Yanyan Shen;Su Chen;Beng Chin Ooi
Affiliations:
National University of Singapore;National University of Singapore;National University of Singapore;National University of Singapore
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 18
Cited 10

LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A Cost Model and Index Architecture for the Similarity Join

Proceedings of the 17th International Conference on Data Engineering
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Indexing the Distance: An Efficient Method to KNN Processing

Proceedings of the 27th International Conference on Very Large Data Bases
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
The k-Nearest Neighbour Join: Turbo Charging the KDD Process

Knowledge and Information Systems
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Efficient index-based KNN join processing for high-dimensional data

Information and Software Technology
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Gorder: an efficient method for KNN join processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Voronoi-Based Geospatial Query Processing with MapReduce

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Efficient parallel kNN joins for large data in MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Parallel Top-K Similarity Join Algorithms Using MapReduce

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Processing multi-way spatial joins on map-reduce

Proceedings of the 16th International Conference on Extending Database Technology
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
BlueFinder: recommending wikipedia links using DBpedia properties

Proceedings of the 5th Annual ACM Web Science Conference
Nearest group queries

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Database research at the National University of Singapore

ACM SIGMOD Record
CG_Hadoop: computational geometry in MapReduce

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Mobility and social networking: a data management perspective

Proceedings of the VLDB Endowment
A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

Proceedings of the VLDB Endowment
DIMO: distributed index for matching multimedia objects using MapReduce

Proceedings of the 5th ACM Multimedia Systems Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.