Implementing a GPU programming model on a Non-GPU accelerator architecture

Authors:
Stephen M. Kofsky;Daniel R. Johnson;John A. Stratton;Wen-mei W. Hwu;Sanjay J. Patel;Steven S. Lumetta
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Year:
2010

Citing 9
Cited 0

The portability of parallel programs across MIMD computers

The portability of parallel programs across MIMD computers
Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software
The design and development of ZPL

Proceedings of the third ACM SIGPLAN conference on History of programming languages
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Accelerating advanced MRI reconstructions on GPUs

Journal of Parallel and Distributed Computing
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Languages and Compilers for Parallel Computing
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.