A scalable high performant Cholesky factorization for multicore with GPU accelerators

Authors:
Hatem Ltaief;Stanimire Tomov;Rajib Nath;Peng Du;Jack Dongarra
Affiliations:
Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville;Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville;Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville;Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville;Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville
Venue:
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Year:
2010

Citing 4
Cited 1

Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing

GA-GPU: extending a library-based global address spaceprogramming model for scalable heterogeneouscomputing systems

Proceedings of the 9th conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA's Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.