Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
New tiling techniques to improve cache temporal locality
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Cache-Efficient Multigrid Algorithms
International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Parameterized tiled loops for free
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Efficient computation of sum-products on GPUs through software-managed cache
Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Smashing: Folding Space to Tile through Time
Languages and Compilers for Parallel Computing
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine
Proceedings of the 23rd international conference on Supercomputing
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms
Journal of Parallel and Distributed Computing
A Multilevel Parallelization Framework for High-Order Stencil Computations
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
State-of-the-art in heterogeneous computing
Scientific Programming
Cache oblivious parallelograms in iterative stencil computations
Proceedings of the 24th ACM International Conference on Supercomputing
Introducing the semi-stencil algorithm
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Landing stencil code on Godson-T
Journal of Computer Science and Technology
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Understanding stencil code performance on multicore architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
HiFlow3: a flexible and hardware-aware parallel finite element package
Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Out-of-Core Computations of High-Resolution Level Sets by Means of Code Transformation
Journal of Scientific Computing
ACM Transactions on Programming Languages and Systems (TOPLAS)
Fast seismic modeling and reverse time migration on a graphics processing unit cluster
Concurrency and Computation: Practice & Experience
High throughput software for direct numerical simulations of compressible two-phase flows
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A data-driven approach for executing the CG method on reconfigurable high-performance systems
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Performance-reliability tradeoff analysis for multithreaded applications
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Stencil-based kernels constitute the core of many scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. We examine several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor. The optimizations target cache reuse across stencil sweeps, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure. Finally, we consider stencil computations on a machine with an explicitly-managed memory hierarchy, the Cell processor. Overall, results show that a cache-aware approach is significantly faster than a cache oblivious approach and that the explicitly managed memory on Cell is more efficient: Relative to the Power5, it has almost 2x more memory bandwidth and is 3.7x faster.