Optimizing stencil application on multi-thread GPU architecture using stream programming model

Authors:
Fang Xudong;Tang Yuhua;Wang Guibin;Tang Tao;Zhang Ying
Affiliations:
National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, Hunan, China
Venue:
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Year:
2010

Citing 9
Cited 1

Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Simulation of cloud dynamics on graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
GPU Cluster for High Performance Computing

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Identifying and Exploiting Spatial Regularity in Data Memory References

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Architecture-aware optimization targeting multithreaded stream computing

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness GPUs' powerful computing capacity to accelerate the applications in the field of scientific computing still remains a big challenge. In this paper, we implement the whole application Mgrid taken from Spec2000 benchmarks on an AMD GPU and propose several optimization strategies for stencil computations in the naive GPU code. We first improve thread utilization through using vector types and multiple output streams mechanism provided by the Brook+ programming language. By tuning thread granularity, we try to hit the right balance between locality within each thread and parallelism among threads. Then, we reorganize the stream layout by transforming the 3D data stream into the 2D stream in the block manner. Through stream reorganization, more data locality in the cache is exploited. Further, we propose branch elimination to convert control dependence to data dependence, catering to GPUs' powerful ALU-intensive processing capability. Finally, we redistribute computations between CPU and GPU to make more advisable computing resources usage considering different problem sizes. We demonstrate the effectiveness of our proposed optimization strategies on an AMD Radeon HD4870 GPU using the Brook+ programming language. Using a double-precision floating-point implementation, the experimental results show that the optimized GPU version of Mgrid gains 2.38x speedup compared to the naive GPU code and obtains as high as 15.06x speedup versus the CPU implementation run on an Intel Xeon E5405 CPU.