Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Authors:
Changhao Jiang;Marc Snir
Affiliations:
University of Illinois Urbana Champaign;University of Illinois Urbana Champaign
Venue:
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Year:
2005

Citing 15
Cited 10

Reality Engine graphics

SIGGRAPH '93 Proceedings of the 20th annual conference on Computer graphics and interactive techniques
InfiniteReality: a real-time graphics system

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Ray tracing on programmable graphics hardware

Proceedings of the 29th annual conference on Computer graphics and interactive techniques
The ray engine

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Optimizing Sorting with Genetic Algorithms

Proceedings of the international symposium on Code generation and optimization
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications

Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
A code motion technique for accelerating general-purpose computation on the GPU

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Parallel genetic algorithm on the CUDA architecture

EvoApplicatons'10 Proceedings of the 2010 international conference on Applications of Evolutionary Computation - Volume Part I
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to utilize the tremendous computing power of grpahics hardware and to automatically adapt to the fast and frequent changes in its architecture and performance characteristics, this paper implements an automatic tuning system to generate high-performance matrix-multiplication implementation on graphics hardware. The automatic tuning system uses a parameterized code generator to generate multiple versions of matrix multiplication, whose performances are empirically evaluated by actual execution on the target platform. An ad-hoc search engine is employed to search over the implementation space for the version that yields the best performance. In contrast to similar systems on CPUs, which utilize cache blocking, register tiling, instruction scheduling tuning strategies, this paper identifies and exploits several tuning strategies that are unique for graphics hardware. These tuning strategies include optimizing for multiple-render-targets, SIMD instructions with data packing, overcoming limitations on instruction count and dynamic branch instruction. The generated implementations have comparable performance with expert manually tuned version in spite of the significant overhead incurred due to the use of the high-level BrookGPU language.