Parallelizing SOR for GPGPUs using alternate loop tiling

Authors:
Peng Di;Hui Wu;Jingling Xue;Feng Wang;Canqun Yang
Affiliations:
School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia;School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia;School of Computer Science, National University of Defense Technology, Changsha 410073, China;School of Computer Science, National University of Defense Technology, Changsha 410073, China
Venue:
Parallel Computing
Year:
2012

Citing 29
Cited 1

Multicolor reordering of sparse matrices resulting from irregular grids

ACM Transactions on Mathematical Software (TOMS)
Loop tiling for parallelism

Loop tiling for parallelism
Generating efficient tiled code for distributed memory machines

Parallel Computing
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Sparse Tiling for Stationary Iterative Methods

International Journal of High Performance Computing Applications
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A New Block Parallel SOR Method and Its Analysis

SIAM Journal on Scientific Computing
Multigrid and Gauss-Seidel smoothers revisited: parallelization on chip multiprocessors

Proceedings of the 20th annual international conference on Supercomputing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
A cross-input adaptive framework for GPU program optimizations

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Implementing the PGI Accelerator model

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Proceedings of the 24th ACM International Conference on Supercomputing
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Proceedings of the 24th ACM International Conference on Supercomputing
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
A quantitative performance analysis model for GPU architectures

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

The performance model for a parallel SOR algorithm using the red-black scheme

International Journal of High Performance Systems Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gauss-Seidel and SOR, which are widely used smoothers in multigrid methods, are difficult to parallelize, particularly on GPGPUs due to the existence of DOACROSS data dependences. In this paper, we present a new parallel SOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a loop tiling technique called alternate tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of PDE-like DOACROSS loops on GPGPUs.