Cache Accurate Time Skewing in Iterative Stencil Computations

Authors:
Robert Strzodka;Mohammed Shaheen;Dawid Pajak;Hans-Peter Seidel
Affiliations:
-;-;-;-
Venue:
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Year:
2011

Citing 0
Cited 2

Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52\% of the stencil peak benchmark).We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50\% of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes.