QR factorization for the Cell Broadband Engine

Authors:
Jakub Kurzak;Jack Dongarra
Affiliations:
(Corresponding author. E-mail: kurzak@eecs.utk.edu) Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA;Dept. of Elec. Eng. and Comp. Sci., Univ. of Tennessee, Knoxville, TN, USA and Comp. Sci. and Mathematics Div., Oak Ridge National Lab., Oak Ridge, TN, USA and Sch. of Math. and Sch. of Comp. Sci. ...
Venue:
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Year:
2009

Citing 22
Cited 12

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
LAPACK's user's guide

LAPACK's user's guide
Applied numerical linear algebra

Applied numerical linear algebra
ScaLAPACK user's guide

ScaLAPACK user's guide
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems

PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
LAPACK Working Note 68: A Parallel Algorithm for the Reduction of a Nonsymmetric Matrix to Block Upper-Hessenberg Form

LAPACK Working Note 68: A Parallel Algorithm for the Reduction of a Nonsymmetric Matrix to Block Upper-Hessenberg Form
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Implementation of mixed precision in solving systems of linear equations on the Cell processor: Research Articles

Concurrency and Computation: Practice & Experience
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing

A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Scheduling two-sided transformations using tile algorithms on multicore architectures

Scientific Programming
Multi-FFT Vectorization for the Cell Multicore Processor

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
An efficient CELL library for lattice quantum chromodynamics

ACM SIGARCH Computer Architecture News
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
Extending synchronization constructs in openMP to exploit pipeline parallelism on heterogeneous multi-core

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software (TOMS)
An implementation of the tile QR factorization for a GPU and multiple CPUs

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Concurrency and Computation: Practice & Experience
Accelerator-Based implementation of the harris algorithm

ICISP'12 Proceedings of the 5th international conference on Image and Signal Processing
Scalable matrix decompositions with multiple cores on FPGAs

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK library, suffer from performance limitations due to the use of matrix-vector type operations in the phase of panel factorization. These limitations can be remedied by using the idea of updating of QR factorization, rendering an algorithm, which is much more scalable and much more suitable for implementation on a multi-core processor. It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism.