Towards dense linear algebra for hybrid GPU accelerated manycore systems

Authors:
Stanimire Tomov;Jack Dongarra;Marc Baboulin
Affiliations:
University of Tennessee, Department of Electrical Engineering and Computer Science, 1122 Volunteer Blvd, Knoxville, TN 37996-3450, USA;University of Tennessee, Department of Electrical Engineering and Computer Science, 1122 Volunteer Blvd, Knoxville, TN 37996-3450, USA and Oak Ridge National Laboratory, USA and University of Manc ...;University of Tennessee, Department of Electrical Engineering and Computer Science, 1122 Volunteer Blvd, Knoxville, TN 37996-3450, USA and University of Coimbra, Department of Mathematics, 3001-45 ...
Venue:
Parallel Computing
Year:
2010

Citing 15
Cited 15

GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)

GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

ACM Transactions on Mathematical Software (TOMS)
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Exploring New Architectures in Accelerating CFD for Air Force Applications

HPCMP-UGC '08 Proceedings of the 2008 DoD HPCMP Users Group Conference
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Accelerating linpack with CUDA on heterogenous clusters

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Prospectus for the next LAPACK and ScaLAPACK libraries

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
Assessing the computational benefits of AREA-oriented DAG-scheduling

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Soft error resilient QR factorization for hybrid system with GPGPU

Proceedings of the second workshop on Scalable algorithms for large-scale systems
An implementation of the tile QR factorization for a GPU and multiple CPUs

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Sparse LU factorization for parallel circuit simulation on GPU

Proceedings of the 49th Annual Design Automation Conference
A Map-Reduce Based Framework for Heterogeneous Processing Element Cluster Environments

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications
Towards distributed heterogenous high-performance computing with ViennaCL

LSSC'11 Proceedings of the 8th international conference on Large-Scale Scientific Computing
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems

Proceedings of the 27th international ACM conference on International conference on supercomputing
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms

Proceedings of the ACM International Conference on Computing Frontiers
On the Easy Use of Scientific Computing Services for Large Scale Linear Algebra and Parallel Decision Making with the P-Grade Portal

Journal of Grid Computing
CPU-GPU hybrid bidiagonal reduction with soft error resilience

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Efficient 3D stencil computations using CUDA

Parallel Computing
A Novel CPU-GPU Cooperative Implementation of A Parallel Two-List Algorithm for the Subset-Sum Problem

Proceedings of Programming Models and Applications on Multicores and Manycores

Quantified Score

Hi-index	0.00

Visualization

Abstract

We highlight the trends leading to the increased appeal of using hybrid multicore+GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.