The parallel execution of DO loops
Communications of the ACM
SKaMPI: A Detailed, Accurate MPI Benchmark
Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
WARPP: a toolkit for simulating high-performance parallel scientific codes
Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
On the limits of GPU acceleration
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing sweep3d for graphic processor unit
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Simulation of bevel gear cutting with GPGPUs--performance and productivity
Computer Science - Research and Development
Using compiler directives for accelerating CFD applications on GPUs
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Predictive modeling and analysis of OP2 on distributed memory GPU clusters
ACM SIGMETRICS Performance Evaluation Review
Effective sampling-driven performance tools for GPU-accelerated supercomputers
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Coordinator-master-worker model for efficient large scale network simulation
Proceedings of the 6th International ICST Conference on Simulation Tools and Techniques
An investigation of the performance portability of OpenCL
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumergrade products to high-end HPC-grade devices, including the Tesla C2050 built on NVIDIA's Fermi processor. We also utilise recently developed performance models of LU to facilitate a comparison between future large-scale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quad-socket, quad-core AMD Opteron cluster and an IBM BlueGene/P.