Applying recursion to serial and parallel QR factorization leads to better performance

Authors:
E. Elmroth;F. G. Gustavson
Affiliations:
Department of Computing Science and High Performance Computing Center North, Umeå University, Umeå, Sweden;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York
Venue:
IBM Journal of Research and Development
Year:
2000

Citing 6
Cited 34

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems

Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Very large electronic structure calculations using an out-of-core filter-diagonalization method

Journal of Computational Physics
Parallel and Fully Recursive Multifrontal Supernodal Sparse Cholesky

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
High-Performance Library Software for QR Factorization

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
A Fast Minimal Storage Symmetric Indefinite Solver

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
A Recursive Formulation of the Inversion of Symmetric Positive Definite Matrices in Packed Storage Data Format

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
New Generalized Data Structures for Matrices Lead to a Variety of High Performance Algorithms

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
New Parallel (Rank-Revealing) QR Factorization Algorithms

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel and fully recursive multifrontal sparse Cholesky

Future Generation Computer Systems - Special issue: Selected numerical algorithms
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
The Opie compiler from row-major source to Morton-ordered matrices

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Recursive approach in sparse matrix LU factorization

Scientific Programming
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Implementing a parallel matrix factorization library on the cell broadband engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Minimizing communication in sparse matrix solvers

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel tiled QR factorization for multicore architectures

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Scheduling two-sided transformations using tile algorithms on multicore architectures

Scientific Programming
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures

Parallel Computing
Discriminating biased web manipulations in terms of link oriented measures

ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Management of deep memory hierarchies: recursive blocked algorithms and hybrid data structures for dense matrix computations

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
FaIMS: A fast algorithm for the inverse medium problem with multiple frequencies and multiple sources for the scalar Helmholtz equation

Journal of Computational Physics
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing
Accelerating the singular value decomposition of rectangular matrices with the CSK600 and the integrable SVD

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
An approach of the QR factorization for tall-and-skinny matrices on multicore platforms

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations. However, there are significant additional costs for creating and performing the updates, which prohibit the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000® SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-MHz IBM PowerPC® 604e SMP node. The hybrid recursive algorithm reaches more than 90% of the theoretical peak performance of the POWER2 node. Compared to standard block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a four-way 332-MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For two, three, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.