New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Using Hardware Counters to Automatically Improve Memory Performance
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Rapidly Selecting Good Compiler Optimizations using Performance Counters
Proceedings of the International Symposium on Code Generation and Optimization
Feedback-directed thread scheduling with memory considerations
Proceedings of the 16th international symposium on High performance distributed computing
Detecting Change in Program Behavior for Adaptive Optimization
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
What can performance counters do for memory subsystem analysis?
Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Effective performance measurement and analysis of multithreaded applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Pinpointing and Exploiting Opportunities for Enhancing Data Reuse
ISPASS '08 Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software
Real time power estimation and thread scheduling via performance counters
ACM SIGARCH Computer Architecture News
Parallel data-locality aware stencil computations on modern micro-architectures
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Improving parallelism and locality with asynchronous algorithms
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Automated empirical tuning of scientific codes for performance and power consumption
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Proceedings of the 9th conference on Computing Frontiers
Hi-index | 0.00 |
Stencil computations are the foundation of many large applications in scientific computing. Previous research has shown that several optimization mechanisms, including rectangular blocking and time skewing combined with wavefront- and pipeline-based parallelization, can be used to significantly improve the performance of stencil kernels on multi-core architectures. However, the overall performance impact of these optimizations are difficult to predict due to the interplay of load imbalance, synchronization overhead, and cache locality. This paper presents a detailed performance study of these optimizations by applying them with a wide variety of different configurations, using hardware counters to monitor the efficiency of architectural components, and then developing a set of formulas via regression analysis to model their overall performance impact in terms of the affected hardware counter numbers. We have applied our methodology to three stencil computation kernels, a 7-point jacobi, a 27-point jacobi, and a 7-point Gauss-Seidel computation. Our experimental results show that a precise formula can be developed for each kernel to accurately model the overall performance impact of varying optimizations and thereby effectively guide the performance analysis and tuning of these kernels.