Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Methods of inference and learning for performance modeling of parallel applications
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A regression-based approach to scalability prediction
Proceedings of the 22nd annual international conference on Supercomputing
A cross-input adaptive framework for GPU program optimizations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
MiDataSets: creating the conditions for a more realistic evaluation of Iterative optimization
HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Evaluating iterative optimization across 1000 datasets
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Input-aware auto-tuning for directive-based GPU programming
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
A large-scale cross-architecture evaluation of thread-coarsening
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Graphics Processing Units (GPUs) are efficient devices capable of delivering high performance for general purpose computation. Realizing their full performance potential often requires extensive compiler tuning. This process is particularly expensive since it has to be repeated for each target program and platform. In this paper we study the utilization of GPU hardware resources across multiple input sizes and compiler options. In this context we introduce the notion of hardware saturation. Saturation is reached when an application is executed with a number of threads large enough to fully utilize the available hardware resources. We give experimental evidence of hardware saturation and describe its properties using 16 OpenCL kernels on 3 GPUs from Nvidia and AMD. We show that input sizes that saturates the GPU show performance stability across compiler transformations. Using the thread-coarsening transformation as an example, we show that compiler settings maintain their relative performance across input sizes within the saturation region. Leveraging these hardware and software properties we propose a technique to identify the input size at the lower bound of the saturation zone, we call it Minimum Saturation Point (MSP). By performing iterative compilation on the MSP input size we obtain results effectively applicable for much large input problems reducing the overhead of tuning by an order of magnitude on average.