Accurate singular values of bidiagonal matrices
SIAM Journal on Scientific and Statistical Computing
The bidiagonal singular value decomposition and Hamiltonian mechanics
SIAM Journal on Numerical Analysis
A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix
SIAM Journal on Matrix Analysis and Applications
A Divide-and-Conquer Algorithm for the Bidiagonal SVD
SIAM Journal on Matrix Analysis and Applications
Matrix computations (3rd ed.)
ScaLAPACK user's guide
Unitary Triangularization of a Nonsymmetric Matrix
Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Efficient parallel reduction to bidiagonal form
Parallel Computing
Algorithm 807: The SBR Toolbox—software for successive band reduction
ACM Transactions on Mathematical Software (TOMS)
The Decompositional Approach to Matrix Computation
Computing in Science and Engineering
Evaluating Block Algorithm Variants in LAPACK
Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Design and Evaluation of Parallel Block Algorithms: LU Factorization on an IBM 3090 VF/600J
Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Automatic blocking of QR and LU factorizations for locality
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Block and Parallel Versions of One-Sided Bidiagonalization
SIAM Journal on Matrix Analysis and Applications
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures
IEEE Transactions on Parallel and Distributed Systems
The impact of multicore on math software
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Computer Architecture, Fifth Edition: A Quantitative Approach
Computer Architecture, Fifth Edition: A Quantitative Approach
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Concurrency and Computation: Practice & Experience
Computer Science - Research and Development
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
Proceedings of the 27th international ACM conference on International conference on supercomputing
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.