The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
LAPACK's user's guide
Applied numerical linear algebra
Applied numerical linear algebra
ScaLAPACK user's guide
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
LAPACK Working Note 68: A Parallel Algorithm for the Reduction of a Nonsymmetric Matrix to Block Upper-Hessenberg Form
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Concurrency and Computation: Practice & Experience
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
CellSs: making it easier to program the cell broadband engine processor
IBM Journal of Research and Development
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Scheduling two-sided transformations using tile algorithms on multicore architectures
Scientific Programming
Multi-FFT Vectorization for the Cell Multicore Processor
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
An efficient CELL library for lattice quantum chromodynamics
ACM SIGARCH Computer Architecture News
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Journal of Computational and Applied Mathematics
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion
ACM Transactions on Mathematical Software (TOMS)
An implementation of the tile QR factorization for a GPU and multiple CPUs
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Concurrency and Computation: Practice & Experience
Accelerator-Based implementation of the harris algorithm
ICISP'12 Proceedings of the 5th international conference on Image and Signal Processing
Scalable matrix decompositions with multiple cores on FPGAs
Microprocessors & Microsystems
Hi-index | 0.00 |
The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK library, suffer from performance limitations due to the use of matrix-vector type operations in the phase of panel factorization. These limitations can be remedied by using the idea of updating of QR factorization, rendering an algorithm, which is much more scalable and much more suitable for implementation on a multi-core processor. It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism.