Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Authors:
Canqun Yang;Feng Wang;Yunfei Du;Juan Chen;Jie Liu;Huizhan Yi;Kai Lu
Affiliations:
-;-;-;-;-;-;-
Venue:
CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
Year:
2010

Citing 0
Cited 7

Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Proceedings of the 26th ACM international conference on Supercomputing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Towards efficient GPU sharing on multicore processors

ACM SIGMETRICS Performance Evaluation Review
Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.