SIAM Journal on Matrix Analysis and Applications
ACM Transactions on Mathematical Software (TOMS)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms
IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Algorithm 784: GEMM-based level 3 BLAS: portability and optimization issues
ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Recursive Blocked Data Formats and BLAS's for Dense Linear Algebra Algorithms
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Parallel Algorithms for Triangular Sylvester Equations: Design, Scheduling and Saclability Issues
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Parallel Two-Sided Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Hi-index | 0.00 |
We present recursive blocked algorithms for solving triangular Sylvester-type matrix equations. Recursion leads to automatic blocking that is variable and "squarish". The main part of the computations are performed as level 3 general matrix multiply and add (GEMM) operations. We also present new highly optimized superscalar kernels for solving small-sized matrix equations stored in level 1 cache. Hereby, a larger part of the total execution time will be spent in GEMM operations. In turn, this leads to much better performance, especially for small to medium-sized problems, and improved parallel efficiency on shared memory processor (SMP) systems. Uniprocessor and SMP parallel performance results are presented and compared with results from existing LAPACK routines for solving this type of matrix equations.