C-DAC's efforts: application kernels on HPC cluster with GPU accelerators

Authors:
Vcv. Rao;Nisha Agrawal;Samrit Maity
Affiliations:
C-DAC, Pune University Campus, Pune, Maharashtra, India;C-DAC, Pune University Campus, Pune, Maharashtra, India;C-DAC, Pune University Campus, Pune, Maharashtra, India
Venue:
Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
Year:
2012

Citing 8
Cited 0

MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Parallelization of Finite Volume Computations for Heat Transfer Application Using Unstructured Mesh Partitioning Algorithms

HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
GPU Cluster for High Performance Computing

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

Parallel Computing
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
CUDA by Example: An Introduction to General-Purpose GPU Programming

CUDA by Example: An Introduction to General-Purpose GPU Programming
Heterogeneous Computing with OpenCL

Heterogeneous Computing with OpenCL

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the problem of parallelization of finite difference method (FDM) and finite element method (FEM) computations for certain class of partial differential equations (PDEs) on High Performance Computing (HPC) GPU cluster. For FDM, the structured grids have been employed and optimal data rearrangement operations are performed in GPU computations. For FEM, unstructured triangular and hexahedral meshes are generated and graph partitioning METIS [14] software is used to generate load-balanced sub-domains. The iterative methods have been used to solve result algebraic matrix system of linear equations. A combination of MPI with CUDA and OpenCL enabled NVIDIA as well as OpenCL based AMD-ATI GPUs of HPC GPU Cluster have been used in our experiments [4,6,7,8]. Our experiments indicate that the MPI-CUDA codes based on FDM and FEM achieves nearly 6x speed-ups for large mesh sizes in comparison to host-cpu implementation of the same code. The un-optimized OpenCL implementation GPU times have shown marginal improvement in speed-ups whereas counterpart the CUDA codes achieved maximum speedup of 4x to 6x on HPC GPU Cluster. We presented performance analysis for different mesh sizes that prove performance capabilities of performance and scalability of FDM and FEM computations GPU cluster.