An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Profile-guided optimization of critical medical imaging algorithms
ISBI'09 Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro
Non-negative matrix factorization implementation using graphic processing units
IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Scalable framework for mapping streaming applications onto multi-GPU systems
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
International Journal of High Performance Computing Applications
PARTANS: An autotuning framework for stencil computation on multi-GPU systems
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
GPU optimization of convolution for large 3-d real images
ACIVS'12 Proceedings of the 14th international conference on Advanced Concepts for Intelligent Vision Systems
Parallel implementation of a X-ray tomography reconstruction algorithm based on MPI and CUDA
Proceedings of the 20th European MPI Users' Group Meeting
Efficient co-processor utilization in database query processing
Information Systems
Hi-index | 0.00 |
Graphics Processing Units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIA's CUDA interface, are becoming the platform of choice in the scientific computing community. Previous studies that used GPUs focused on obtaining significant performance gains from execution on a single GPU. These studies employed low-level, architecture-specific tuning in order to achieve sizeable benefits over multicore CPU execution. In this paper, we consider the benefits of running on multiple (parallel) GPUs to provide further orders of performance speedup. Our methodology allows developers to accurately predict execution time for GPU applications while varying the number and configuration of the GPUs, and the size of the input data set. This is a natural next step in GPU computing because it allows researchers to determine the most appropriate GPU configuration for an application without having to purchase hardware, or write the code for a multiple-GPU implementation. When used to predict performance on six scientific applications, our framework produces accurate performance estimates (11% difference on average and 40% maximum difference in a single case) for a range of short and long running scientific programs.