The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A scalable auto-tuning framework for compiler optimization
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
isl: an integer set library for the polyhedral model
ICMS'10 Proceedings of the Third international congress conference on Mathematical software
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Dynamic selection of tile sizes
HIPC '11 Proceedings of the 2011 18th International Conference on High Performance Computing
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Hi-index | 0.00 |
Graphics Processing Units (GPUs) are today's most powerful coprocessors for accelerating massive data-parallel algorithms. However, programmers are forced to adopt new programming paradigms to take full advantage of their computing capabilities; this requires significant programming and maintenance effort. As a result, there is an increasing interest in the development of tools for automatic mapping of sequential code to GPUs. Current automatic tools require both a deep knowledge on the GPU architecture and the algorithm being mapped, which makes the mapping process a labor-intensive task. This paper proposes a technique that improves the code mapping of one of these tools, PPCG, removing the need for any user interaction. It relies on data reuse estimations to explore the mapping space and compute appropriate values for the number of threads per threadblock and tile sizes. Our results show speedups of 3x on average compared to the default code generated by PPCG.