Auto-tuning stencil codes for cache-based multicore platforms

Authors:
Katherine A. Yelick;Kaushik Datta
Affiliations:
University of California, Berkeley;University of California, Berkeley
Venue:
Auto-tuning stencil codes for cache-based multicore platforms
Year:
2009

Citing 0
Cited 8

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
ALTER: exploiting breakable dependences for parallelization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important. However, the diversity of multicore machines in today's market compels us to individually tune for each platform. This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates. One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid. Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline. In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel. We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue Gene/P. These platforms not only have significant variations in their core architectures, but also exhibit a 32× range in available hardware threads, a 4.5× range in attained DRAM bandwidth, and a 6.3× range in peak flop rates. Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge. Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms. Instead, we have created an automatic stencil code tuner, or auto-tuner, that incorporates several optimizations into a single software framework. These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions. The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning. The fully auto-tuned code runs up to 5.4× faster than a straightforward implementation and is more scalable across cores. By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study. This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.