The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Efficient K-Means Clustering Using Accelerated Graphics Processors
DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Glimmer: Multilevel MDS on the GPU
IEEE Transactions on Visualization and Computer Graphics
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
K-Means on Commodity GPUs with CUDA
CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 03
Dimension reduction and visualization of large high-dimensional data via interpolation
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Twister: a runtime for iterative MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Speeding up K-Means Algorithm by GPUs
CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Efficient PageRank and SpMV Computation on AMD GPUs
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining
Proceedings of the VLDB Endowment
A new method for GPU based irregular reductions and its application to k-means clustering
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Bioinformatics
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure
UCC '11 Proceedings of the 2011 Fourth IEEE International Conference on Utility and Cloud Computing
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Least squares quantization in PCM
IEEE Transactions on Information Theory
Hi-index | 0.00 |
We present a study of OpenCL implementations of three important kernels that occur frequently in iterative statistical applications: multi-dimensional scaling MDS, PageRank and K-means clustering. We evaluated their performance on NVIDIA Tesla and Fermi GPGPU cards using dedicated hardware, and in the case of Fermi, also on the Amazon EC2 cloud-computing environment. We explored the optimisation of these kernels by four main techniques: 1 caching invariant data in GPU memory across iterations; 2 selectively placing data in different memory levels; 3 rearranging data in memory; 4 dividing the work between the GPU and the CPU. We also implemented a novel algorithm for MDS and a novel data layout scheme for PageRank. Our optimisations resulted in performance improvements of up to 5× to 6×, compared to naïve OpenCL implementations and up to 100× improvement over single-core CPU. We believe that these categories of optimisations are also applicable to other similar kernels.