SIGGRAPH '93 Proceedings of the 20th annual conference on Computer graphics and interactive techniques
InfiniteReality: a real-time graphics system
Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
Ray tracing on programmable graphics hardware
Proceedings of the 29th annual conference on Computer graphics and interactive techniques
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast matrix multiplies using graphics hardware
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
ACM SIGGRAPH 2003 Papers
Fast computation of database operations using graphics processors
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Optimizing Sorting with Genetic Algorithms
Proceedings of the international symposium on Code generation and optimization
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
Mars: a MapReduce framework on graphics processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
A code motion technique for accelerating general-purpose computation on the GPU
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Parallel genetic algorithm on the CUDA architecture
EvoApplicatons'10 Proceedings of the 2010 international conference on Applications of Evolutionary Computation - Volume Part I
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Hi-index | 0.00 |
In order to utilize the tremendous computing power of grpahics hardware and to automatically adapt to the fast and frequent changes in its architecture and performance characteristics, this paper implements an automatic tuning system to generate high-performance matrix-multiplication implementation on graphics hardware. The automatic tuning system uses a parameterized code generator to generate multiple versions of matrix multiplication, whose performances are empirically evaluated by actual execution on the target platform. An ad-hoc search engine is employed to search over the implementation space for the version that yields the best performance. In contrast to similar systems on CPUs, which utilize cache blocking, register tiling, instruction scheduling tuning strategies, this paper identifies and exploits several tuning strategies that are unique for graphics hardware. These tuning strategies include optimizing for multiple-render-targets, SIMD instructions with data packing, overcoming limitations on instruction count and dynamic branch instruction. The generated implementations have comparable performance with expert manually tuned version in spite of the significant overhead incurred due to the use of the high-level BrookGPU language.