LOF: identifying density-based local outliers
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A Cost Model and Index Architecture for the Similarity Join
Proceedings of the 17th International Conference on Data Engineering
Algorithms for Mining Distance-Based Outliers in Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Indexing the Distance: An Efficient Method to KNN Processing
Proceedings of the 27th International Conference on Very Large Data Bases
Index-driven similarity search in metric spaces (Survey Article)
ACM Transactions on Database Systems (TODS)
The k-Nearest Neighbour Join: Turbo Charging the KDD Process
Knowledge and Information Systems
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search
ACM Transactions on Database Systems (TODS)
Efficient index-based KNN join processing for high-dimensional data
Information and Software Technology
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Gorder: an efficient method for KNN join processing
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
The performance of MapReduce: an in-depth study
Proceedings of the VLDB Endowment
Voronoi-Based Geospatial Query Processing with MapReduce
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Processing theta-joins using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors
Proceedings of the VLDB Endowment
Efficient parallel kNN joins for large data in MapReduce
Proceedings of the 15th International Conference on Extending Database Technology
Parallel Top-K Similarity Join Algorithms Using MapReduce
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Processing multi-way spatial joins on map-reduce
Proceedings of the 16th International Conference on Extending Database Technology
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
BlueFinder: recommending wikipedia links using DBpedia properties
Proceedings of the 5th Annual ACM Web Science Conference
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
Database research at the National University of Singapore
ACM SIGMOD Record
CG_Hadoop: computational geometry in MapReduce
Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Mobility and social networking: a data management perspective
Proceedings of the VLDB Endowment
A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data
Proceedings of the VLDB Endowment
DIMO: distributed index for matching multimedia objects using MapReduce
Proceedings of the 5th ACM Multimedia Systems Conference
Hi-index | 0.00 |
k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.