Stencils and problem partitionings: their influence on the performance of multiple processor systems
IEEE Transactions on Computers
Cactus Application: Performance Predictions in Grid Environments
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Patterns for parallel programming
Patterns for parallel programming
Exploring the multiple-GPU design space
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
A design pattern language for engineering (parallel) software: merging the PLPP and OPL projects
Proceedings of the 2010 Workshop on Parallel Programming Patterns
Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems
Proceedings of the 4th International Workshop on Multicore Software Engineering
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
PADS: A Pattern-Driven Stencil Compiler-Based Tool for Reuse of Optimizations on GPGPUs
ICPADS '11 Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems
Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Autotuning Wavefront Applications for Multicore Multi-GPU Hybrid Architectures
Proceedings of Programming Models and Applications on Multicores and Manycores
Hi-index | 0.00 |
GPGPUs are a powerful and energy-efficient solution for many problems. For higher performance or larger problems, it is necessary to distribute the problem across multiple GPUs, increasing the already high programming complexity. In this article, we focus on abstracting the complexity of multi-GPU programming for stencil computation. We show that the best strategy depends not only on the stencil operator, problem size, and GPU, but also on the PCI express layout. This adds nonuniform characteristics to a seemingly homogeneous setup, causing up to 23% performance loss. We address this issue with an autotuner that optimizes the distribution across multiple GPUs.