Understanding stencil code performance on multicore architectures

Authors:
Shah M. Faizur Rahman;Qing Yi;Apan Qasem
Affiliations:
University of Texas at San Antonio;University of Texas at San Antonio;Texas State University San Marcos, TX
Venue:
Proceedings of the 8th ACM International Conference on Computing Frontiers
Year:
2011

Citing 25
Cited 1

New tiling techniques to improve cache temporal locality

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Data locality enhancement by memory reduction

ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Rapidly Selecting Good Compiler Optimizations using Performance Counters

Proceedings of the International Symposium on Code Generation and Optimization
Feedback-directed thread scheduling with memory considerations

Proceedings of the 16th international symposium on High performance distributed computing
Detecting Change in Program Behavior for Adaptive Optimization

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
What can performance counters do for memory subsystem analysis?

Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Effective performance measurement and analysis of multithreaded applications

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Pinpointing and Exploiting Opportunities for Enhancing Data Reuse

ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
Real time power estimation and thread scheduling via performance counters

ACM SIGARCH Computer Architecture News
Parallel data-locality aware stencil computations on modern micro-architectures

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM Review
Improving parallelism and locality with asynchronous algorithms

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org

Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Automated empirical tuning of scientific codes for performance and power consumption

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

Studying the impact of application-level optimizations on the power consumption of multi-core architectures

Proceedings of the 9th conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stencil computations are the foundation of many large applications in scientific computing. Previous research has shown that several optimization mechanisms, including rectangular blocking and time skewing combined with wavefront- and pipeline-based parallelization, can be used to significantly improve the performance of stencil kernels on multi-core architectures. However, the overall performance impact of these optimizations are difficult to predict due to the interplay of load imbalance, synchronization overhead, and cache locality. This paper presents a detailed performance study of these optimizations by applying them with a wide variety of different configurations, using hardware counters to monitor the efficiency of architectural components, and then developing a set of formulas via regression analysis to model their overall performance impact in terms of the affected hardware counter numbers. We have applied our methodology to three stencil computation kernels, a 7-point jacobi, a 27-point jacobi, and a 7-point Gauss-Seidel computation. Our experimental results show that a precise formula can be developed for each kernel to accurately model the overall performance impact of varying optimizations and thereby effectively guide the performance analysis and tuning of these kernels.