A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
Vector and parallel algorithms for Cholesky factorization on IBM 3090
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Matrix computations (3rd ed.)
Solving Linear Algebraic Equations on an MIMD Computer
Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
A Parallel Algorithm for the Reduction to Tridiagonal Form for Eigendecomposition
SIAM Journal on Scientific Computing
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Block and Parallel Versions of One-Sided Bidiagonalization
SIAM Journal on Matrix Analysis and Applications
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
CellSs: making it easier to program the cell broadband engine processor
IBM Journal of Research and Development
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Applying recursion to serial and parallel QR factorization leads to better performance
IBM Journal of Research and Development
Implementing linear algebra routines on multi-core processors with pipelining and a look ahead
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems
SIAM Journal on Scientific Computing
Efficient reduction from block hessenberg form to hessenberg form using shared memory
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Hi-index | 0.00 |
The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009), 38-53) introduced the concept of tile algorithms in which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of the tile algorithms approach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.