GPU enhanced parallel computing for large scale data clustering

Authors:
Xiaohui Cui;Jesse St. Charles;Thomas Potok
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN 37831, United States and New York Institute of Technology, Manhattan, NY 10023, United States;Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, United States;Oak Ridge National Laboratory, Oak Ridge, TN 37831, United States
Venue:
Future Generation Computer Systems
Year:
2013

Citing 6
Cited 0

Flocks, herds and schools: A distributed behavioral model

SIGGRAPH '87 Proceedings of the 14th annual conference on Computer graphics and interactive techniques
An algorithm for suffix stripping

Readings in information retrieval
Data clustering: a review

ACM Computing Surveys (CSUR)
A flocking based algorithm for document clustering analysis

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Nature-inspired applications and systems
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams

ICMLA '06 Proceedings of the 5th International Conference on Machine Learning and Applications
K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Analyzing and clustering large scale data set is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. One limitation of this method of data clustering is its complexity O(n^2). As the number of data and feature dimensions grows, it becomes increasingly difficult to generate results in a reasonable amount of time. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. In this paper, we have conducted research to exploit this architecture and apply its strengths to the flocking based high dimension data clustering problem. Using the CUDA platform from NVIDIA, we developed a Multiple Species Data Flocking implementation to be run on the NVIDIA GPU. Performance gains ranged from 30 to 60 times improvement of the GPU over the 3GHz CPU implementation.