Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures

Authors:
Azzam Haidar;Hatem Ltaief;Asim YarKhan;Jack Dongarra
Affiliations:
Department ofElectrical Engineering and Computer Science, University of Tennessee, Knoxville, TNUSA;Department ofElectrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA;Department ofElectrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA;Dept. of Electrical Eng. and Comp. Sci., Univ. of Tennessee, Knoxville, TN, USA and Comp. Sci. and Mathematics Division, Oak Ridge Natnl. Lab., Oak Ridge, Tennesse and Sch. of Mathematics & Sch. o ...
Venue:
Concurrency and Computation: Practice & Experience
Year:
2011

Citing 18
Cited 3

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide

LAPACK's user's guide
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Applied numerical linear algebra

Applied numerical linear algebra
ScaLAPACK user's guide

ScaLAPACK user's guide
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
A class of parallel tiled linear algebra algorithms for multicore architectures

Parallel Computing
QR factorization for the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The libflame Library for Dense Matrix Computations

Computing in Science and Engineering

High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

ACM Transactions on Mathematical Software (TOMS)
Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An improved parallel singular value algorithm and its implementation for multicore hardware

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared-memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left-looking and right-looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd.