Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Authors:
Fengguang Song;Hatem Ltaief;Bilel Hadri;Jack Dongarra
Affiliations:
-;-;-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 7
Cited 7

Distributed orthogonal factorization: givens and householder algorithms

SIAM Journal on Scientific and Statistical Computing
LAPACK's user's guide

LAPACK's user's guide
ScaLAPACK user's guide

ScaLAPACK user's guide
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Distributed QR factorization based on randomized algorithms

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
A framework for low-communication 1-D FFT

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hierarchical QR factorization algorithms for multi-core clusters

Parallel Computing
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable matrix decompositions with multiple cores on FPGAs

Microprocessors & Microsystems
A framework for low-communication 1-D FFT

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.