Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

Authors:
Piotr Luszczek;Hatem Ltaief;Jack Dongarra
Affiliations:
-;-;-
Venue:
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Year:
2011

Citing 0
Cited 7

Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Communication avoiding successive band reduction

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

Computer Science - Research and Development
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication

Proceedings of the 27th international ACM conference on International conference on supercomputing
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for two-sided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and difficult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor-memory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an efficient implementation of the tridiagonal reduction, which is the first algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a \emph{tile} layout i.e., a high performance data representation, which substantially enhances data locality. Following a two-stage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a \emph{left-looking} bulge chasing technique to reduce memory traffic and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction significantly outperforms the state-of-the-art numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.