Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

Authors:
Azzam Haidar;Hatem Ltaief;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville, TN;KAUST Supercomputing Laboratory, Thuwal, Saudi Arabia;University of Tennessee, Knoxville, TN
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 19
Cited 4

Solution of large, dense symmetric generalized eigenvalue problems using secondary storage

ACM Transactions on Mathematical Software (TOMS)
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
ScaLAPACK user's guide

ScaLAPACK user's guide
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Banded Eigenvalue Solvers on Vector Machines

ACM Transactions on Mathematical Software (TOMS)
Algorithm 807: The SBR Toolbox—software for successive band reduction

ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The libflame Library for Dense Matrix Computations

IEEE Design & Test
Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures

IEEE Transactions on Parallel and Distributed Systems
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Communication avoiding successive band reduction

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Efficient generalized Hessenberg form and applications

ACM Transactions on Mathematical Software (TOMS)
Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication

Proceedings of the 27th international ACM conference on International conference on supercomputing
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the application's overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50-fold and 12-fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000.