An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Authors:
Jiajia Li;Xingjian Li;Guangming Tan;Mingyu Chen;Ninghui Sun
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 14
Cited 0

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU

Proceedings of the 23rd international conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
An Improved Magma Gemm For Fermi Graphics Processing Units

International Journal of High Performance Computing Applications
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
A fast GEMM implementation on the cypress GPU

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. In this paper we investigate advanced software-pipelining optimizations for the double-precision general matrix multiplication (DGEMM) algorithm running on a heterogeneous system that includes ATI GPUs. Our approach decomposes the DGEMM workload to a finer detail and hides the latency of CPU-GPU data transfers to a higher degree than previous approaches in literature. We implement our approach in a five-stage software pipelined DGEMM and analyze its performance on a platform including x86 multi-core CPUs and an ATI Radeon™ HD5970 GPU that has two Cypress GPU chips on board. Our implementation delivers 758 GFLOPS (82% floating-point efficiency) when it uses only the GPU, and 844 GFLOPS (80% efficiency) when it distributes the workload on both CPU and GPU. We analyze the performance of our optimized DGEMM as the number of GPU chips employed grows from one to two, and the results show that resource contention on the PCIe bus and on the host memory are limiting factors.