Cache oblivious parallelograms in iterative stencil computations

  • Authors:
  • Robert Strzodka;Mohammed Shaheen;Dawid Pajak;Hans-Peter Seidel

  • Affiliations:
  • Max Planck Institut Informatik, Saarbrücken, Germany;Max Planck Institut Informatik, Saarbrücken, Germany;West Pomeranian University of Technology, Szczecin, Poland;Max Planck Institut Informatik, Saarbrücken, Germany

  • Venue:
  • Proceedings of the 24th ACM International Conference on Supercomputing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results. The performance benefits stem from a tiling structure that caters for data locality, parallelism and vectorization simultaneously. Rather than tiling the iteration space from inside, we take an exterior approach with a predefined hierarchy, simple regular parallelogram tiles and a locality preserving parallelization. These advantages come at the cost of an irregular work-load distribution but a tightly integrated load-balancer ensures a high utilization of all resources.