Concurrent number cruncher: a GPU implementation of a general sparse linear solver

Authors:
Luc Buatois;Guillaume Caumon;Bruno Levy
Affiliations:
ENSG/CRPG, Gocad Research Group, Nancy University, Vandoeuvre-les-Nancy, France,INRIA Lorraine, ALICE, Vandoeuvre-les-Nancy Cedex, France;ENSG/CRPG, Gocad Research Group, Nancy University, Vandoeuvre-les-Nancy, France;INRIA Lorraine, ALICE, Vandoeuvre-les-Nancy Cedex, France
Venue:
International Journal of Parallel, Emergent and Distributed Systems
Year:
2009

Citing 8
Cited 21

The Cg Tutorial: The Definitive Guide to Programmable Real-Time Graphics

The Cg Tutorial: The Definitive Guide to Programmable Real-Time Graphics
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
OpenGL(R) Shading Language

OpenGL(R) Shading Language
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Metaprogramming GPUs with Sh

Metaprogramming GPUs with Sh
A performance-oriented data parallel virtual machine for GPUs

ACM SIGGRAPH 2006 Sketches
Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

International Journal of Parallel, Emergent and Distributed Systems

Duplex fitting of zero-level and offset surfaces

Computer-Aided Design
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
Fast Conjugate Gradients with Multiple GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
GPU based sparse grid technique for solving multidimensional options pricing PDEs

Proceedings of the 2nd Workshop on High Performance Computational Finance
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Motion-based video retargeting with optimized crop-and-warp

ACM SIGGRAPH 2010 papers
Natural neighbor interpolation based grid DEM construction using a GPU

Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
A fast GPU implementation for solving sparse ill-posed linear equation systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining

Proceedings of the VLDB Endowment
Iterative solution of linear systems in electromagnetics (and not only): experiences with CUDA

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Implicit FEM and fluid coupling on GPU for interactive multiphysics simulation

ACM SIGGRAPH 2011 Talks
The challenges of writing portable, correct and high performance libraries for GPUs

ACM SIGARCH Computer Architecture News
Parallel preconditioned conjugate gradient algorithm on GPU

Journal of Computational and Applied Mathematics
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

Parallel Computing
GPU-based parallel algorithms for sparse nonlinear systems

Journal of Parallel and Distributed Computing
Tuning solution of large non-Hermitian linear systems on multiple graphics processing unit accelerated workstations

International Journal of High Performance Computing Applications
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

The Journal of Supercomputing
Architecting the finite element method pipeline for the GPU

Journal of Computational and Applied Mathematics
A novel finite element method assembler for co-processors and accelerators

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Parallel Computing
Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on graphics processor units (GPUs), especially since dedicated general purpose APIs such as close-to-metal (CTM) (AMD-ATI) and compute unified device architecture (CUDA) (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage (BCRS), register blocking and vectorization), to implement a sparse general-purpose linear solver. Our implementation of the Jacobi-preconditioned conjugate gradient algorithm outperforms by up to a factor of 6.0 × leading-edge CPU counterparts, making it attractive for applications which are content with single precision.