Program optimization carving for GPU computing

Authors:
Shane Ryoo;Christopher I. Rodrigues;Sam S. Stone;John A. Stratton;Sain-Zee Ueng;Sara S. Baghsorkhi;Wen-mei W. Hwu
Affiliations:
Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, United States;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, United States;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, United States;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, United States;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, United States;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, United States;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, United States
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 25
Cited 25

An overview of the PTRAN analysis system for multiprocessing

Proceedings of the 1st International Conference on Supercomputing
Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Data relocation and prefetching for programs with large data sets

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Automatic loop interchange

SIGPLAN '84 Proceedings of the 1984 SIGPLAN symposium on Compiler construction
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Compiler optimization-space exploration

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Predicting the impact of optimizations for embedded systems

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Improving Cache Behavior of Dynamically Allocated Data Structures

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
Automatic Selection of Compiler Options Using Non-parametric Inferential Statistics

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Using Machine Learning to Focus Iterative Optimization

Proceedings of the International Symposium on Code Generation and Optimization
A systematic approach to delivering instruction-level parallelism in epic systems

A systematic approach to delivering instruction-level parallelism in epic systems
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Evaluating Heuristic Optimization Phase Order Search Algorithms

Proceedings of the International Symposium on Code Generation and Optimization
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Iterative induced dipoles computation for molecular mechanics on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
GPU-based island model for evolutionary algorithms

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Towards metaprogramming for parallel systems on a chip

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

International Journal of High Performance Computing Applications
ACO with tabu search on a GPU for solving QAPs using move-cost adjusted thread assignment

Proceedings of the 13th annual conference on Genetic and evolutionary computation
GPU-Based approaches for multiobjective local search algorithms. a case study: the flowshop scheduling problem

EvoCOP'11 Proceedings of the 11th European conference on Evolutionary computation in combinatorial optimization
Solving a kind of boundary-value problem for ordinary differential equations using Fermi-The next generation CUDA computing architecture

Journal of Computational and Applied Mathematics
Development of parallel explicit finite element sheet forming simulation system based on GPU architecture

Advances in Engineering Software
A new parallel method of smith-waterman algorithm on a heterogeneous platform

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Implementing p systems parallelism by means of GPUs

WMC'09 Proceedings of the 10th international conference on Membrane Computing
Optimizing stencil application on multi-thread GPU architecture using stream programming model

ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Local search algorithms on graphics processing units. a case study: the permutation perceptron problem

EvoCOP'10 Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial Optimization
GPU-Based multi-start local search algorithms

LION'05 Proceedings of the 5th international conference on Learning and Intelligent Optimization
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures

Proceedings of the 9th conference on Computing Frontiers
Generating GPU code from a high-level representation for image processing kernels

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Reducing thread divergence in GPU-based b&b applied to the flow-shop problem

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
A simulated annealing algorithm for GPU clusters

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
ACO on multiple GPUs with CUDA for faster solution of QAPs

PPSN'12 Proceedings of the 12th international conference on Parallel Problem Solving from Nature - Volume Part II
Parallelization strategies for hybrid metaheuristics using a single GPU and multi-core resources

PPSN'12 Proceedings of the 12th international conference on Parallel Problem Solving from Nature - Volume Part II
Mastering software variant explosion for GPU accelerators

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
User transparent data and task parallel multimedia computing with Pyxis-DT

Future Generation Computer Systems
Optimising space exploration of OpenCL for GPGPUs

International Journal of Computational Science and Engineering
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique that begins with a complete optimization space and prunes it down to a set of configurations that is likely to contain the global maximum. The remaining configurations can then be evaluated to determine the one with the best performance. The technique can reduce the number of configurations to be evaluated by as much as 98% and is successful at finding a near-best configuration. For some applications, we show that this approach is significantly superior to random sampling of the search space.