Solving k-Nearest Neighbor Problem on Multiple Graphics Processors

Authors:
Kimikazu Kato;Tikara Hosino
Affiliations:
-;-
Venue:
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Year:
2010

Citing 4
Cited 2

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Latent dirichlet allocation

The Journal of Machine Learning Research
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
A Practical Quicksort Algorithm for Graphics Processors

ESA '08 Proceedings of the 16th annual European symposium on Algorithms

A social network-aware top-N recommender system using GPU

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Improving the speed and stability of the k-nearest neighbors method

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

A recommendation system is a software system to predict customers' unknown preferences from known preferences. In a recommendation system, customers' preferences are encoded into vectors, and finding the nearest vectors to each vector is an essential part. This vector-searching part of the problem is called a $k$-nearest neighbor problem. We give an effective algorithm to solve this problem on multiple graphics processor units (GPUs). Our algorithm consists of two parts: an $N$-body problem and a partial sort. For a algorithm of the $N$-body problem, we applied the idea of a known algorithm for the $N$-body problem in physics, although another trick is need to overcome the problem of small sized shared memory. For the partial sort, we give a novel GPU algorithm which is effective for small $k$. In our partial sort algorithm, a heap is accessed in parallel by threads with a low cost of synchronization. Both of these two parts of our algorithm utilize maximal power of coalesced memory access, so that a full bandwidth is achieved. By an experiment, we show that when the size of the problem is large, an implementation of the algorithm on two GPUs runs more than 330 times faster than a single core implementation on a latest CPU. We also show that our algorithm scales well with respect to the number of GPUs.