Automatic Library Generation for BLAS3 on GPUs

Authors:
Huimin Cui;Lei Wang;Jingling Xue;Yang Yang;Xiaobing Feng
Affiliations:
-;-;-;-;-
Venue:
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Year:
2011

Citing 0
Cited 6

Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Layout-oblivious compiler optimization for matrix computations

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An empirical model for predicting cross-core performance interference on multicore processors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-performance libraries, the performance-critical building blocks for high-level applications, will assume greater importance on modern processors as they become more complex and diverse. However, automatic library generators are still immature, forcing library developers to manually tune library to meet their performance objectives. We are developing a new script-controlled compilation framework to help domain experts reduce much of the tedious and error-prone nature of manual tuning, by enabling them to leverage their expertise and reuse past optimization experiences. We focus on demonstrating improved performance and productivity obtained through using our framework to tune BLAS3 routines on three GPU platforms: up to 5.4x speedups over the CUBLAS achieved on NVIDIA GeForce 9800, 2.8x on GTX285, and 3.4x on Fermi Tesla C2050. Our results highlight the potential benefits of exploiting domain expertise and the relations between different routines (in terms of their algorithms and data structures).