High performance matrix inversion based on LU factorization for multicore architectures

Authors:
Jack Dongarra;Mathieu Faverge;Hatem Ltaief;Piotr Luszczek
Affiliations:
University of Tennessee & Oak Ridge National Laboratory & University of Manchester, Knoxville, TN, USA;University of Tennessee, Knoxville, TN, USA;KAUST Supercomputing Laboratory, Thuwal, Saudi Arabia;University of Tennessee, Knoxville, TN, USA
Venue:
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Year:
2011

Citing 12
Cited 1

Computing the polar decomposition with applications

SIAM Journal on Scientific and Statistical Computing
Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
ScaLAPACK user's guide

ScaLAPACK user's guide
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Computer Methods for Mathematical Computations

Computer Methods for Mathematical Computations
Analysis of Pairwise Pivoting in Gaussian Elimination

IEEE Transactions on Computers
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications

IEEE Transactions on Parallel and Distributed Systems
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
DAGuE: A Generic Distributed DAG Engine for High Performance Computing

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of this paper is to present an efficient implementation of an explicit matrix inversion of general square matrices on multicore computer architecture. The inversion procedure is split into four steps: 1) computing the LU factorization, 2) inverting the upper triangular U factor, 3) solving a linear system, whose solution yields inverse of the original matrix and 4) applying backward column pivoting on the inverted matrix. Using a tile data layout, which represents the matrix in the system memory with an optimized cache-aware format, the computation of the four steps is decomposed into computational tasks. A directed acyclic graph is generated on the fly which represents the program data flow. Its nodes represent tasks and edges the data dependencies between them. Previous implementations of matrix inversions, available in the state-of-the-art numerical libraries, are suffer from unnecessary synchronization points, which are non-existent in our implementation in order to fully exploit the parallelism of the underlying hardware. Our algorithmic approach allows to remove these bottlenecks and to execute the tasks with loose synchronization. A runtime environment system called QUARK is necessary to dynamically schedule our numerical kernels on the available processing units. The reported results from our LU-based matrix inversion implementation significantly outperform the state-of-the-art numerical libraries such as LAPACK (5x), MKL (5x) and ScaLAPACK (2.5x) on a contemporary AMD platform with four sockets and the total of 48 cores for a matrix of size 24000. A power consumption analysis shows that our high performance implementation is also energy efficient and substantially consumes less power than its competitors.