StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
A Dynamically Tuned Sorting Library
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
A framework for adaptive algorithm selection in STAPL
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Orchestrating the execution of stream programs on multicore platforms
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
Software Pipelined Execution of Stream Programs on GPUs
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
A cross-input adaptive framework for GPU program optimizations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Automatic parallelization for graphics processing units
PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Compiling Python to a hybrid execution environment
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Implementing the PGI Accelerator model
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
MacroSS: macro-SIMDization of streaming applications
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
MapCG: writing parallel program portable between CPU and GPU
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An input-centric paradigm for program dynamic optimizations
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architectures
CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
hiCUDA: High-Level GPGPU Programming
IEEE Transactions on Parallel and Distributed Systems
Copperhead: compiling an embedded data parallel language
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Sponge: portable stream programming on graphics engines
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
CnC-CUDA: declarative programming for GPUs
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Input-aware auto-tuning for directive-based GPU programming
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
ACM Transactions on Programming Languages and Systems (TOPLAS)
SAGE: self-tuning approximation for graphics engines
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations
International Journal of Parallel Programming
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Proceedings of Workshop on General Purpose Processing Using GPUs
Leveraging GPUs using cooperative loop speculation
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
While graphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, the tedious process of performance tuning required to optimize applications is an obstacle to wider adoption of GPUs. In addition to the programmability challenges posed by GPU's complex memory hierarchy and parallelism model, a well-known application design problem is target portability across different GPUs. However, even for a single GPU target, changing a program's input characteristics can make an already-optimized implementation of a program perform poorly. In this work, we propose Adaptic, an adaptive input-aware compilation system to tackle this important, yet overlooked, input portability problem. Using this system, programmers develop their applications in a high-level streaming language and let Adaptic undertake the difficult task of input portable optimizations and code generation. Several input-aware optimizations are introduced to make efficient use of the memory hierarchy and customize thread composition. At runtime, a properly optimized version of the application is executed based on the actual program input. We perform a head-to-head comparison between the Adaptic generated and hand-optimized CUDA programs. The results show that Adaptic is capable of generating codes that can perform on par with their hand-optimized counterparts over certain input ranges and outperform them when the input falls out of the hand-optimized programs' "comfort zone". Furthermore, we show that input-aware results are sustainable across different GPU targets making it possible to write and optimize applications once and run them anywhere.