High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

Authors:
Hatem Ltaief;Piotr Luszczek;Jack Dongarra
Affiliations:
KAUST Supercomputing Laboratory, Thuwal, Saudi Arabia;University of Tennessee, Knoxville, TN;University of Tennessee, Oak Ridge National Laboratory, and University of Manchester, Knoxville, TN
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2013

Citing 27
Cited 3

Accurate singular values of bidiagonal matrices

SIAM Journal on Scientific and Statistical Computing
The bidiagonal singular value decomposition and Hamiltonian mechanics

SIAM Journal on Numerical Analysis
A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix

SIAM Journal on Matrix Analysis and Applications
A Divide-and-Conquer Algorithm for the Bidiagonal SVD

SIAM Journal on Matrix Analysis and Applications
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
ScaLAPACK user's guide

ScaLAPACK user's guide
Unitary Triangularization of a Nonsymmetric Matrix

Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Efficient parallel reduction to bidiagonal form

Parallel Computing
Algorithm 807: The SBR Toolbox—software for successive band reduction

ACM Transactions on Mathematical Software (TOMS)
The Decompositional Approach to Matrix Computation

Computing in Science and Engineering
Evaluating Block Algorithm Variants in LAPACK

Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Design and Evaluation of Parallel Block Algorithms: LU Factorization on an IBM 3090 VF/600J

Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific Computing
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Automatic blocking of QR and LU factorizations for locality

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Block and Parallel Versions of One-Sided Bidiagonalization

SIAM Journal on Matrix Analysis and Applications
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures

IEEE Transactions on Parallel and Distributed Systems
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Computer Architecture, Fifth Edition: A Quantitative Approach

Computer Architecture, Fifth Edition: A Quantitative Approach
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Concurrency and Computation: Practice & Experience

Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

Computer Science - Research and Development
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication

Proceedings of the 27th international ACM conference on International conference on supercomputing
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.