Cache Accurate Time Skewing in Iterative Stencil Computations

  • Authors:
  • Robert Strzodka;Mohammed Shaheen;Dawid Pajak;Hans-Peter Seidel

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of $500^3$ doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52\% of the stencil peak benchmark).We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50\% of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes.