Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

Authors:
S. J. Pennycook;S. D. Hammond;S. A. Jarvis;G. R. Mudalige
Affiliations:
University of Warwick, UK;University of Warwick, UK;University of Warwick, UK;University of Oxford, UK
Venue:
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Year:
2011

Citing 11
Cited 6

The parallel execution of DO loops

Communications of the ACM
SKaMPI: A Detailed, Accurate MPI Benchmark

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A General Predictive Performance Model for Wavefront Algorithms on Clusters of SMPs

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
WARPP: a toolkit for simulating high-performance parallel scientific codes

Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing sweep3d for graphic processor unit

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

Simulation of bevel gear cutting with GPGPUs--performance and productivity

Computer Science - Research and Development
Using compiler directives for accelerating CFD applications on GPUs

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Predictive modeling and analysis of OP2 on distributed memory GPU clusters

ACM SIGMETRICS Performance Evaluation Review
Effective sampling-driven performance tools for GPU-accelerated supercomputers

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Coordinator-master-worker model for efficient large scale network simulation

Proceedings of the 6th International ICST Conference on Simulation Tools and Techniques
An investigation of the performance portability of OpenCL

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumergrade products to high-end HPC-grade devices, including the Tesla C2050 built on NVIDIA's Fermi processor. We also utilise recently developed performance models of LU to facilitate a comparison between future large-scale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quad-socket, quad-core AMD Opteron cluster and an IBM BlueGene/P.