Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
The memory behavior of cache oblivious stencil computations
The Journal of Supercomputing
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel Lattice Boltzmann Flow Simulation on Emerging Multi-core Platforms
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms
Journal of Parallel and Distributed Computing
A Multilevel Parallelization Framework for High-Order Stencil Computations
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization
COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Computing and Visualization in Science
Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
LBM based flow simulation using GPU computing processor
Computers & Mathematics with Applications
Auto-tuning stencil codes for cache-based multicore platforms
Auto-tuning stencil codes for cache-based multicore platforms
PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Toward high-throughput algorithms on many-core architectures
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Parallel simulation of dendritic growth on unstructured grids
Proceedings of the first workshop on Irregular applications: architectures and algorithm
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures
Proceedings of the 26th ACM international conference on Supercomputing
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
Towards autotuning by alternating communication methods
ACM SIGMETRICS Performance Evaluation Review
The Journal of Supercomputing
Patus for convenient high-performance stencils: evaluation in earthquake simulations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient backprojection-based synthetic aperture radar computation with many-core processors
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization of geometric multigrid for emerging multi- and manycore processors
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Vectorized higher order finite difference kernels
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
International Journal of High Performance Computing Applications
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient 3D stencil computations using CUDA
Parallel Computing
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations
International Journal of Parallel Programming
Efficient backprojection-based synthetic aperture radar computation with many-core processors
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth. Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density. We present a novel 3.5D-blocking algorithm that performs 2.5D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs. The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores. Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs. Our implementation of 7-point-stencil is 1.5X-faster on CPUs, and 1.8X faster on GPUs for single- precision floating point inputs than previously reported numbers. For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 2.1X.