Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

Authors:
Gregorio Quintana-Orti;Enrique S. Quintana-Orti;Ernie Chan;Robert A. van de Geijn;Field G. Van Zee
Affiliations:
-;-;-;-;-
Venue:
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Year:
2008

Citing 0
Cited 9

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
An Algorithm-by-Blocks for SuperMatrix Band Cholesky Factorization

High Performance Computing for Computational Science - VECPAR 2008
Programming matrix algorithms-by-blocks for thread-level parallelism

ACM Transactions on Mathematical Software (TOMS)
Scheduling two-sided transformations using tile algorithms on multicore architectures

Scientific Programming
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
High-performance up-and-downdating via householder-like transformations

ACM Transactions on Mathematical Software (TOMS)
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines the scalable parallel implementation of the QR factorizationof a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the so-called critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on ccNUMA platform with 16 processors and an SMP architecture with 16 cores.