The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Applied numerical linear algebra
Applied numerical linear algebra
ScaLAPACK user's guide
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Parallel out-of-core computation and updating of the QR factorization
ACM Transactions on Mathematical Software (TOMS)
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Updating an LU Factorization with Pivoting
ACM Transactions on Mathematical Software (TOMS)
Parallel tiled QR factorization for multicore architectures
Concurrency and Computation: Practice & Experience
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
IEEE Transactions on Parallel and Distributed Systems
QR factorization for the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The libflame Library for Dense Matrix Computations
Computing in Science and Engineering
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
ACM Transactions on Mathematical Software (TOMS)
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
An improved parallel singular value algorithm and its implementation for multicore hardware
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared-memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left-looking and right-looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd.