Understanding stencil code performance on multicore architectures

  • Authors:
  • Shah M. Faizur Rahman;Qing Yi;Apan Qasem

  • Affiliations:
  • University of Texas at San Antonio;University of Texas at San Antonio;Texas State University San Marcos, TX

  • Venue:
  • Proceedings of the 8th ACM International Conference on Computing Frontiers
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Stencil computations are the foundation of many large applications in scientific computing. Previous research has shown that several optimization mechanisms, including rectangular blocking and time skewing combined with wavefront- and pipeline-based parallelization, can be used to significantly improve the performance of stencil kernels on multi-core architectures. However, the overall performance impact of these optimizations are difficult to predict due to the interplay of load imbalance, synchronization overhead, and cache locality. This paper presents a detailed performance study of these optimizations by applying them with a wide variety of different configurations, using hardware counters to monitor the efficiency of architectural components, and then developing a set of formulas via regression analysis to model their overall performance impact in terms of the affected hardware counter numbers. We have applied our methodology to three stencil computation kernels, a 7-point jacobi, a 27-point jacobi, and a 7-point Gauss-Seidel computation. Our experimental results show that a precise formula can be developed for each kernel to accurately model the overall performance impact of varying optimizations and thereby effectively guide the performance analysis and tuning of these kernels.