Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Authors:
Feng Wang;Can-Oun Yang;Yun-Fei Du;Juan Chen;Hui-Zhan Yi;Wei-Xia Xu
Affiliations:
School of Computer Science, National University of Defense Technology, Changsha, China;School of Computer Science, National University of Defense Technology, Changsha, China;School of Computer Science, National University of Defense Technology, Changsha, China;School of Computer Science, National University of Defense Technology, Changsha, China;School of Computer Science, National University of Defense Technology, Changsha, China;School of Computer Science, National University of Defense Technology, Changsha, China
Venue:
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Year:
2011

Citing 16
Cited 1

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Scalability issues affecting the design of a dense linear algebra library

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
GPU Cluster for High Performance Computing

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Introduction to the cell broadband engine architecture

IBM Journal of Research and Development
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Petascale computing with accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Accelerating linpack with CUDA on heterogenous clusters

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
Power-aware dynamic task scheduling for heterogeneous accelerated clusters

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Long time-scale simulations of in vivo diffusion using GPU hardware

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Parallel LDPC decoding on GPUs using a stream-based computing approach

Journal of Computer Science and Technology - Special section on trust and reputation management in future computing systmes and applications
Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing

Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and CPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability, 3.3 times faster than the result by using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.