An improved parallel singular value algorithm and its implementation for multicore hardware

Authors:
Azzam Haidar;Jakub Kurzak;Piotr Luszczek
Affiliations:
University of Tennessee, Knoxville, Tennessee;University of Tennessee, Knoxville, Tennessee;University of Tennessee, Knoxville, Tennessee
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 32
Cited 0

The solution of large dense generalized eigenvalue problems on the Cray X-MP/24 with SSD

Journal of Computational Physics
The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
Solution of large, dense symmetric generalized eigenvalue problems using secondary storage

ACM Transactions on Mathematical Software (TOMS)
A bridging model for parallel computation

Communications of the ACM
Accurate singular values of bidiagonal matrices

SIAM Journal on Scientific and Statistical Computing
The bidiagonal singular value decomposition and Hamiltonian mechanics

SIAM Journal on Numerical Analysis
LAPACK's user's guide

LAPACK's user's guide
A parallel algorithm for reducing symmetric banded matrices to tridiagonal form

SIAM Journal on Scientific Computing
A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix

SIAM Journal on Matrix Analysis and Applications
A Divide-and-Conquer Algorithm for the Bidiagonal SVD

SIAM Journal on Matrix Analysis and Applications
Unitary Triangularization of a Nonsymmetric Matrix

Journal of the ACM (JACM)
Efficient eigenvalue and singular value computations on shared memory machines

Parallel Computing - Special issue on parallelization techniques for numerical modelling
Algorithm 807: The SBR Toolbox—software for successive band reduction

ACM Transactions on Mathematical Software (TOMS)
The Decompositional Approach to Matrix Computation

Computing in Science and Engineering
Information Filtering Using the Riemannian SVD (R-SVD)

IRREGULAR '98 Proceedings of the 5th International Symposium on Solving Irregularly Structured Problems in Parallel
Multi-sweep Algorithms for the Symmetric Eigenproblem

VECPAR '98 Selected Papers and Invited Talks from the Third International Conference on Vector and Parallel Processing
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scheduling dense linear algebra operations on multicore processors

Concurrency and Computation: Practice & Experience
Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures

IEEE Transactions on Parallel and Distributed Systems
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High performance matrix inversion based on LU factorization for multicore architectures

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software (TOMS)
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Concurrency and Computation: Practice & Experience
Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to address these challenges---starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.