Hierarchical QR factorization algorithms for multi-core clusters

Authors:
Jack Dongarra;Mathieu Faverge;Thomas HéRault;Mathias Jacquelin;Julien Langou;Yves Robert
Affiliations:
University of Tennessee Knoxville 1122 Volunteer Blvd, Knoxville, TN 37996, USA and Oak Ridge National Laboratory 1 Bethel Valley Rd, Oak Ridge, TN 37831, USA and Manchester University, UK School ...;University of Tennessee Knoxville 1122 Volunteer Blvd, Knoxville, TN 37996, USA;University of Tennessee Knoxville 1122 Volunteer Blvd, Knoxville, TN 37996, USA;INRIA Saclay Campus de l'ícole Polytechnique, 91120 Palaiseau, France;University of Colorado Denver PO Box 173364. Denver, CO 80217-3364, USA;University of Tennessee Knoxville 1122 Volunteer Blvd, Knoxville, TN 37996, USA and Ecole Normale Supérieure de Lyon, 69364 Lyon Cedex 07, France
Venue:
Parallel Computing
Year:
2013

Citing 16
Cited 0

Parallel QR Decomposition of a rectangular matrix

Numerische Mathematik
Distributed orthogonal factorization: givens and householder algorithms

SIAM Journal on Scientific and Statistical Computing
Complexity of parallel QR factorization

Journal of the ACM (JACM)
ScaLAPACK user's guide

ScaLAPACK user's guide
On Stable Parallel Linear System Solvers

Journal of the ACM (JACM)
New Parallel (Rank-Revealing) QR Factorization Algorithms

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
LAPACK Working Note 41: Installation Guide for LAPACK

LAPACK Working Note 41: Installation Guide for LAPACK
Grid'5000: A Large Scale and Highly Reconfigurable Grid Experimental Testbed

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
DAGuE: A Generic Distributed DAG Engine for High Performance Computing

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Tiled QR factorization algorithms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, ''communication-avoiding''), it is natural to consider hierarchical trees composed of an ''inter-node'' tree which acts on top of ''intra-node'' trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) ''TS level'' for cache-friendliness, (1) ''low-level'' for decoupled highly parallel inter-node reductions, (2) ''domino level'' to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (i) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ii) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for petascale and exascale platforms.