Accelerating linpack with CUDA on heterogenous clusters

Authors:
Massimiliano Fatica
Affiliations:
NVIDIA Corporation, Santa Clara, CA
Venue:
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Year:
2009

Citing 1
Cited 14

Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers

Performance Optimization Strategies of High Performance Computing on GPU

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
PARRAY: a unifying array representation for heterogeneous parallelism

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GA-GPU: extending a library-based global address spaceprogramming model for scalable heterogeneouscomputing systems

Proceedings of the 9th conference on Computing Frontiers
Heterogeneous systems for energy efficient scientific computing

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

SIAM Journal on Scientific Computing
Overlapping computations with communications and i/o explicitly using OpenMP based heterogeneous threading models

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the use of CUDA to accelerate the Linpack benchmark on heterogenous clusters, where both CPUs and GPUs are used in synergy with minor or no modifications to the original source code. A host library intercepts the calls to DGEMM and DTRSM and executes them simultaneously on both GPUs and CPU cores. An 8U cluster is able to sustain more than a Teraflop using a CUDA accelerated version of HPL.