Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction

Authors:
Hatem Ltaief;Piotr Luszczek;Jack Dongarra
Affiliations:
KAUST Supercomputing Laboratory Thuwal, Saudi Arabia;University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN, USA, Oak Ridge National Laboratory, USA, University of Manchester, United Kingdom
Venue:
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Year:
2011

Citing 10
Cited 1

A comparison between some direct and iterative methods for certian large scale godetic least squares problems

SIAM Journal on Scientific and Statistical Computing
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Unitary Triangularization of a Nonsymmetric Matrix

Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Algorithm 807: The SBR Toolbox—software for successive band reduction

ACM Transactions on Mathematical Software (TOMS)
The Decompositional Approach to Matrix Computation

Computing in Science and Engineering
Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Communication-Avoiding QR Decomposition for GPUs

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using tile algorithms with a two-stage approach has shown very promising results on square matrices. However, for tall and skinny matrices, the inherent problem of processing the panel in a domino-like fashion generates unnecessary sequential tasks. By using tree reduction, the panel is horizontally split, which creates another dimension of parallelism and engenders many concurrent tasks to be dynamically scheduled on the available cores. The results reported in this paper are very encouraging. The new tile bidiagonal transformation, targeting tall and skinny matrices, outperforms the state-of-the-art numerical linear algebra libraries LAPACK V3.2 and Intel MKL ver. 10.3 by up to 29-fold speedup and the standard two-stage PLASMA BRD by up to 20-fold speedup, on an eight socket hexa-core AMD Opteron multicore shared-memory system.