Parallel data-locality aware stencil computations on modern micro-architectures

Authors:
Matthias Christen;Olaf Schenk;Esra Neufeld;Peter Messmer;Helmar Burkhart
Affiliations:
High Performance and Web Computing Group, Computer Science Dept., University of Basel, Switzerland;High Performance and Web Computing Group, Computer Science Dept., University of Basel, Switzerland;IT'IS Foundation, ETH Zurich, Switzerland;Tech-X Corporation, Boulder CO, USA;High Performance and Web Computing Group, Computer Science Dept., University of Basel, Switzerland
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 6

Understanding stencil code performance on multicore architectures

Proceedings of the 8th ACM International Conference on Computing Frontiers
CUDA 2d stencil computations for the jacobi method

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Fast wavelet transform utilizing a multicore-aware framework

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE

The Journal of Supercomputing
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Novel micro-architectures including the Cell Broadband Engine Architecture and graphics processing units are attractive platforms for compute-intensive simulations. This paper focuses on stencil computations arising in the context of a biomedical simulation and presents performance benchmarks on both the Cell BE and GPUs and contrasts them with a benchmark on a traditional CPU system. Due to the low arithmetic intensity of stencil computations, typically only a fraction of the peak performance of the compute hardware is reached. An algorithm is presented, which reduces the bandwidth requirements and thereby improves performance by exploiting temporal locality of the data. We report on performance improvements over CPU implementations.