Algorithmic performance studies on graphics processing units

Authors:
Olaf Schenk;Matthias Christen;Helmar Burkhart
Affiliations:
Department of Computer Science, University of Basel, Klingelbergstrasse 50, CH-4056 Basel, Switzerland;Department of Computer Science, University of Basel, Klingelbergstrasse 50, CH-4056 Basel, Switzerland;Department of Computer Science, University of Basel, Klingelbergstrasse 50, CH-4056 Basel, Switzerland
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 16
Cited 11

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Block sparse Cholesky algorithms on advanced uniprocessor computers

SIAM Journal on Scientific Computing
Augmented Lagrangian--SQP Methods for Nonlinear OptimalControl Problems of Tracking Type

SIAM Journal on Control and Optimization
Performance of Panel and Block Approaches to Sparse Cholesky Factorization on the iPSC/860 and Paragon Multicomputers

SIAM Journal on Scientific Computing
A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling

SIAM Journal on Matrix Analysis and Applications
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Solving unsymmetric sparse systems of linear equations with PARDISO

Future Generation Computer Systems - Special issue: Selected numerical algorithms
A numerical evaluation of HSL packages for the direct solution of large sparse, symmetric linear systems of equations

ACM Transactions on Mathematical Software (TOMS)
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming

Mathematical Programming: Series A and B
Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)

Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

International Journal of Parallel, Emergent and Distributed Systems

Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station

Journal of Real-Time Image Processing
GPGPU-aided ensemble empirical-mode decomposition for EEG analysis during anesthesia

IEEE Transactions on Information Technology in Biomedicine
Enabling Energy-Efficient Analysis of Massive Neural Signals Using GPGPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Parallel implementation of the diffusion-drift algorithm for modeling the electrophysiological activity of breast tumors

Journal of Parallel and Distributed Computing
MPI-CUDA parallelization of a finite-strip program for geometric nonlinear analysis: A hybrid approach

Advances in Engineering Software
A co-evolutionary differential evolution algorithm for solving min-max optimization problems implemented on GPU using C-CUDA

Expert Systems with Applications: An International Journal
GICUDA: A parallel program for 3D correlation imaging of large scale gravity and gravity gradiometry data on graphics processing units with CUDA

Computers & Geosciences
Towards energy-efficient parallel analysis of neural signals

Cluster Computing
Accelerating universal Kriging interpolation algorithm using CUDA-enabled GPU

Computers & Geosciences
Accelerated implementation of adaptive directional lifting-based discrete wavelet transform on GPU

Image Communication
Performance models and workload distribution algorithms for optimizing a hybrid CPU-GPU multifrontal solver

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix-matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications.