Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Performance Optimization Strategies of High Performance Computing on GPU
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Computing
Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
Journal of Computational Physics
PARRAY: a unifying array representation for heterogeneous parallelism
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Proceedings of the 9th conference on Computing Frontiers
Heterogeneous systems for energy efficient scientific computing
ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
A scalable framework for heterogeneous GPU-based clusters
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems
SIAM Journal on Scientific Computing
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Hi-index | 0.00 |
This paper describes the use of CUDA to accelerate the Linpack benchmark on heterogenous clusters, where both CPUs and GPUs are used in synergy with minor or no modifications to the original source code. A host library intercepts the calls to DGEMM and DTRSM and executes them simultaneously on both GPUs and CPU cores. An 8U cluster is able to sustain more than a Teraflop using a CUDA accelerated version of HPL.