A memory model for scientific algorithms on graphics processors

Authors:
Naga K. Govindaraju;Scott Larsen;Jim Gray;Dinesh Manocha
Affiliations:
UNC Chapel Hill and Microsoft Corporation;UNC Chapel Hill;Microsoft Corporation;UNC Chapel Hill
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 30
Cited 43

The input/output complexity of sorting and related problems

Communications of the ACM
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK's user's guide

LAPACK's user's guide
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
The design and analysis of a cache architecture for texture mapping

Proceedings of the 24th annual international symposium on Computer architecture
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Towards a theory of cache-efficient algorithms

Journal of the ACM (JACM)
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Photon mapping on programmable graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Simulation of cloud dynamics on graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Visual simulation of ice crystal growth

Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Shader algebra

ACM SIGGRAPH 2004 Papers
GPU Cluster for High Performance Computing

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
UberFlow: a GPU-based particle engine

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast and approximate stream mining of quantiles and frequencies using graphics processors

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data

EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Cache-efficient numerical algorithms using graphics hardware

Parallel Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Efficient gather and scatter operations on graphics processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Locality-improved FFT implementation on a graphics processor

ISCGAV'07 Proceedings of the 7th WSEAS International Conference on Signal Processing, Computational Geometry & Artificial Vision
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
Technical Section: Accelerated MIP based on GPU using block clipping and occlusion query

Computers and Graphics
Using reconfigurable logic to optimise GPU memory accesses

Proceedings of the conference on Design, automation and test in Europe
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
All-pairs shortest-paths for large graphs on the GPU

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
AES Encryption Implementation and Analysis on Commodity Graphics Processing Units

CHES '07 Proceedings of the 9th international workshop on Cryptographic Hardware and Embedded Systems
Evaluating Computational Performance of Backpropagation Learning on Graphics Hardware

Electronic Notes in Theoretical Computer Science (ENTCS)
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Memory Locality Exploitation Strategies for FFT on the CUDA Architecture

High Performance Computing for Computational Science - VECPAR 2008
Fast and scalable list ranking on the GPU

Proceedings of the 23rd international conference on Supercomputing
Experiences with Mapping Non-linear Memory Access Patterns into GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating multicore algorithms on the unified memory model

Scientific Programming - Software Development for Multi-core Computing Systems
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Design and implementation of a graphical user interface for stream-based distributed computing

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Drug design issues on the cell BE

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
3D GPU architecture using cache stacking: performance, cost, power and thermal analysis

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Algorithm engineering: bridging the gap between algorithm theory and practice

Algorithm engineering: bridging the gap between algorithm theory and practice
Database compression on graphics processors

Proceedings of the VLDB Endowment
Auto-tuning of fast fourier transform on graphics processors

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Analysis of Parallel Algorithms for Energy Conservation with GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
An idiom-finding tool for increasing productivity of accelerators

Proceedings of the international conference on Supercomputing
A systematic design space exploration approach to customising multi-processor architectures: exemplified using graphics processors

Transactions on High-Performance Embedded Architectures and Compilers IV
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
GPURoofline: a model for guiding performance optimizations on GPUs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Energy cost evaluation of parallel algorithms for multiprocessor systems

Cluster Computing
Nonzero pattern analysis and memory access optimization in GPU-based sparse LU factorization for circuit simulation

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Vectorized OpenCL implementation of numerical integration for higher order finite elements

Computers & Mathematics with Applications
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover. we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5 x performance improvement.