Efficient parallel kNN joins for large data in MapReduce

Authors:
Chi Zhang;Feifei Li;Jeffrey Jestes
Affiliations:
Florida State University;University of Utah;University of Utah
Venue:
Proceedings of the 15th International Conference on Extending Database Technology
Year:
2012

Citing 22
Cited 9

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Clone join and shadow join: two parallel spatial join algorithms

Proceedings of the 8th ACM international symposium on Advances in geographic information systems
Data Partitioning for Parallel Spatial Join Processing

Geoinformatica
Parallel Processing of Spatial Joins Using R-trees

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC)

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Parallel R-Tree Spatial Join for a Shared-Nothing Architecture

DANTE '99 Proceedings of the 1999 International Symposium on Database Applications in Non-Traditional Environments
The k-Nearest Neighbour Join: Turbo Charging the KDD Process

Knowledge and Information Systems
Distributed computation of the knn graph for large high-dimensional point sets

Journal of Parallel and Distributed Computing
Efficient index-based KNN join processing for high-dimensional data

Information and Software Technology
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Data-Parallel Spatial Join Algorithms

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 03
Gorder: an efficient method for KNN join processing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Spatial Queries Evaluation with MapReduce

GCC '09 Proceedings of the 2009 Eighth International Conference on Grid and Cooperative Computing
High-dimensional kNN joins with incremental updates

Geoinformatica
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Indexing multi-dimensional data in a cloud system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient B-tree based indexing for cloud data processing

Proceedings of the VLDB Endowment
Voronoi-Based Geospatial Query Processing with MapReduce

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
RanKloud: a scalable ranked query processing framework on hadoop

Proceedings of the 14th International Conference on Extending Database Technology
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Parallel construction of k-nearest neighbor graphs for point clouds

SPBG'08 Proceedings of the Fifth Eurographics / IEEE VGTC conference on Point-Based Graphics

Efficient processing of k nearest neighbor joins using MapReduce

Proceedings of the VLDB Endowment
CudaGIS: report on the design and realization of a massive data parallel GIS on GPUs

Proceedings of the Third ACM SIGSPATIAL International Workshop on GeoStreaming
Speeding up large-scale point-in-polygon test based spatial join on GPUs

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
Nearest group queries

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
CG_Hadoop: computational geometry in MapReduce

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

Proceedings of the VLDB Endowment
DIMO: distributed index for matching multimedia objects using MapReduce

Proceedings of the 5th ACM Multimedia Systems Conference
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In data mining applications and spatial and multimedia databases, a useful tool is the kNN join, which is to produce the k nearest neighbors (NN), from a dataset S, of every point in a dataset R. Since it involves both the join and the NN search, performing kNN joins efficiently is a challenging task. Meanwhile, applications continue to witness a quick (exponential in some cases) increase in the amount of data to be processed. A popular model nowadays for large-scale data processing is the shared-nothing cluster on a number of commodity machines using MapReduce [6]. Hence, how to execute kNN joins efficiently on large data that are stored in a MapReduce cluster is an intriguing problem that meets many practical needs. This work proposes novel (exact and approximate) algorithms in MapReduce to perform efficient parallel kNN joins on large data. We demonstrate our ideas using Hadoop. Extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records in both R and S and up to 30 dimensions, have demonstrated the efficiency, effectiveness, and scalability of our methods.