Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
State-of-the-art in heterogeneous computing
Scientific Programming
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Landing stencil code on Godson-T
Journal of Computer Science and Technology
SIAM Journal on Scientific Computing
Parallel 3D multigrid methods on the STI cell BE architecture
Facing the multicore-challenge
Parallel 3D multigrid methods on the STI cell BE architecture
Facing the multicore-challenge
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic code generation and tuning for stencil kernels on modern shared memory architectures
Computer Science - Research and Development
Understanding stencil code performance on multicore architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hardware/software co-design for energy-efficient seismic modeling
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An approach for semiautomatic locality optimizations using OpenMP
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Fast wavelet transform utilizing a multicore-aware framework
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Communication-Efficient algorithms for numerical quantum dynamics
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Concurrency and Computation: Practice & Experience
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
Optimization of geometric multigrid for emerging multi- and manycore processors
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning
International Journal of High Performance Computing Applications
Vectorized higher order finite difference kernels
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Tight bounds for low dimensional star stencils in the external memory model
WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Hi-index | 0.00 |
Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide our optimizations. Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM Power5. Additionally, we consider stencil computations on the heterogeneous multicore design of the Cell processor, a machine with an explicitly managed memory hierarchy. Overall our work represents one of the most extensive analyses of stencil optimizations and performance modeling to date. Results demonstrate that recent trends in memory system organization have reduced the efficacy of traditional cache-blocking optimizations. We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory on Cell enables the highest overall efficiency: Cell attains 88% of algorithmic peak while the best competing cache-based processor achieves only 54% of algorithmic peak performance.