The art of parallel programming
The art of parallel programming
Applied numerical linear algebra
Applied numerical linear algebra
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Parallel data-locality aware stencil computations on modern micro-architectures
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems
IEEE Design & Test
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
A GPU implementation of a structural-similarity-based aerial-image classification
The Journal of Supercomputing
Hi-index | 0.00 |
We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance computing. These platforms provide a Software Development Kit (SDK) to maximize performance at the expense of dealing with complex and low-level architectural details which makes the software development a daunting task. This paper explores stencil computations in several heterogeneous programming models like Cell SDK, CellSs, ALF and CUDA to optimize the Jacobi method for solving Laplace's differential equation. We describe the programming techniques to extract the maximum performance on the Cell BE and the GPU, and compare their computing paradigms. Experimental results are shown on two Nvidia Teslas and one IBM BladeCenter QS20 blade which incorporates two 3.2 GHz Cell BEs v 5.1. The speed-up factor for our set of GPU optimizations reaches 3---4脳, and the execution times defeat those of the Cell BE by an order of magnitude, also showing great scalability when moving towards newer GPU generations and/or more demanding problem sizes.