Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
High Performance Lattice Boltzmann Algorithms for Fluid Flows
ISISE '08 Proceedings of the 2008 International Symposium on Information Science and Engieering - Volume 01
Solving dense linear systems on platforms with multiple hardware accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems
IEEE Design & Test
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Unstructured grid applications on GPU: performance analysis and improvement
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
CUDASA: compute unified device and systems architecture
EG PGV'08 Proceedings of the 8th Eurographics conference on Parallel Graphics and Visualization
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Currently the set of scientific applications suitable for running on GPUs has increased due to the computational power of GPUs and the availability of programming languages that make more approachable writing scientific applications for GPUs. However, as the size of the problems increases, the global memory of GPUs becomes a limitation for running applications. Multi-GPU systems can potentially make memory limited problems tractable by dividing the data and computation among several GPUs. Parallel execution is seriously limited by the (i) application data dependencies, and (ii) data transfers among GPUs. In this paper we analyze the potential for parallelization of unstructured grid applications based on the data dependencies of the algorithm and the amount of data communication required. Due to data dependencies and the required communication, data and task parallelization techniques present different communication overheads and computing devices utilization. Based on this analysis we propose a scheme that takes advantage of data and task parallelism and reduces the communication overhead through computation-communication overlap. Our OpenCL implementation reduces the communication overhead by 38%, and, for comparison purposes, a two GPU implementation provides almost a five-fold increase in performance as compared to a CPU implementation.