A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU
Proceedings of the 23rd international conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
An Improved Magma Gemm For Fermi Graphics Processing Units
International Journal of High Performance Computing Applications
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
A fast GEMM implementation on the cypress GPU
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Hi-index | 0.00 |
In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. In this paper we investigate advanced software-pipelining optimizations for the double-precision general matrix multiplication (DGEMM) algorithm running on a heterogeneous system that includes ATI GPUs. Our approach decomposes the DGEMM workload to a finer detail and hides the latency of CPU-GPU data transfers to a higher degree than previous approaches in literature. We implement our approach in a five-stage software pipelined DGEMM and analyze its performance on a platform including x86 multi-core CPUs and an ATI Radeon™ HD5970 GPU that has two Cypress GPU chips on board. Our implementation delivers 758 GFLOPS (82% floating-point efficiency) when it uses only the GPU, and 844 GFLOPS (80% efficiency) when it distributes the workload on both CPU and GPU. We analyze the performance of our optimized DGEMM as the number of GPU chips employed grows from one to two, and the results show that resource contention on the PCIe bus and on the host memory are limiting factors.