The solution of large dense generalized eigenvalue problems on the Cray X-MP/24 with SSD
Journal of Computational Physics
The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
Solution of large, dense symmetric generalized eigenvalue problems using secondary storage
ACM Transactions on Mathematical Software (TOMS)
A bridging model for parallel computation
Communications of the ACM
Accurate singular values of bidiagonal matrices
SIAM Journal on Scientific and Statistical Computing
The bidiagonal singular value decomposition and Hamiltonian mechanics
SIAM Journal on Numerical Analysis
LAPACK's user's guide
A parallel algorithm for reducing symmetric banded matrices to tridiagonal form
SIAM Journal on Scientific Computing
A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix
SIAM Journal on Matrix Analysis and Applications
A Divide-and-Conquer Algorithm for the Bidiagonal SVD
SIAM Journal on Matrix Analysis and Applications
Unitary Triangularization of a Nonsymmetric Matrix
Journal of the ACM (JACM)
Efficient eigenvalue and singular value computations on shared memory machines
Parallel Computing - Special issue on parallelization techniques for numerical modelling
Algorithm 807: The SBR Toolbox—software for successive band reduction
ACM Transactions on Mathematical Software (TOMS)
The Decompositional Approach to Matrix Computation
Computing in Science and Engineering
Information Filtering Using the Riemannian SVD (R-SVD)
IRREGULAR '98 Proceedings of the 5th International Symposium on Solving Irregularly Structured Problems in Parallel
Multi-sweep Algorithms for the Symmetric Eigenproblem
VECPAR '98 Selected Papers and Invited Talks from the Third International Conference on Vector and Parallel Processing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scheduling dense linear algebra operations on multicore processors
Concurrency and Computation: Practice & Experience
Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures
IEEE Transactions on Parallel and Distributed Systems
The impact of multicore on math software
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High performance matrix inversion based on LU factorization for multicore architectures
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion
ACM Transactions on Mathematical Software (TOMS)
Concurrency and Computation: Practice & Experience
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
ACM Transactions on Mathematical Software (TOMS)
Hi-index | 0.00 |
The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to address these challenges---starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.