The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Finding effective compilation sequences
Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Using Machine Learning to Focus Iterative Optimization
Proceedings of the International Symposium on Code Generation and Optimization
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
PetaBricks: a language and compiler for algorithmic choice
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Intel® Core i7 Turbo Boost feature
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Fast tridiagonal solvers on the GPU
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Language and compiler support for auto-tuning variable-accuracy algorithms
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Predictive modeling in a polyhedral optimization space
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Mapping a data-flow programming model onto heterogeneous platforms
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now their graphics coprocessors (GPUs), not just their primary CPUs. But GPU programming and memory models differ dramatically from conventional CPUs, and the relative performance characteristics of the different processors vary widely between machines. Different processors within a system often perform best with different algorithms and memory usage patterns, and achieving the best overall performance may require mapping portions of programs across all types of resources in the machine. To address the problem of efficiently programming machines with increasingly heterogeneous computational resources, we propose a programming model in which the best mapping of programs to processors and memories is determined empirically. Programs define choices in how their individual algorithms may work, and the compiler generates further choices in how they can map to CPU and GPU processors and memory systems. These choices are given to an empirical autotuning framework that allows the space of possible implementations to be searched at installation time. The rich choice space allows the autotuner to construct poly-algorithms that combine many different algorithmic techniques, using both the CPU and the GPU, to obtain better performance than any one technique alone. Experimental results show that algorithmic changes, and the varied use of both CPUs and GPUs, are necessary to obtain up to a 16.5x speedup over using a single program configuration for all architectures.