Programming matrix algorithms-by-blocks for thread-level parallelism

Authors:
Gregorio Quintana-Ortí;Enrique S. Quintana-Ortí;Robert A. Van De Geijn;Field G. Van Zee;Ernie Chan
Affiliations:
Universidad Jaume I, Castellón, Spain;Universidad Jaume I, Castellón, Spain;The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2009

Citing 30
Cited 19

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Vector and parallel algorithms for Cholesky factorization on IBM 3090

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Compiler and software distributed shared memory support for irregular applications

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Using PLAPACK: parallel linear algebra package

Using PLAPACK: parallel linear algebra package
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
A survey of out-of-core algorithms in numerical linear algebra

External memory algorithms
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Compiler Analysis for Irregular Problems in Fortran D

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
MaTRiX+/sup +/: an object-oriented environment for parallel high-performance matrix computations

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
OpenMP issues arising in the development of parallel BLAS and LAPACK libraries

Scientific Programming - OpenMP
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Analysis of Pairwise Pivoting in Gaussian Elimination

IEEE Transactions on Computers
Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
Updating an LU Factorization with Pivoting

ACM Transactions on Mathematical Software (TOMS)
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Satisfying your dependencies with SuperMatrix

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Three algorithms for Cholesky factorization on distributed memory using packed storage

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Rapid development of high-performance out-of-core solvers

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Out-of-Core Computation of the QR Factorization on Multi-core Processors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Managing the complexity of lookahead for LU factorization with pivoting

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A fully empirical autotuned dense QR factorization for multicore architectures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Using desktop computers to solve large-scale dense linear algebra problems

The Journal of Supercomputing
High-performance up-and-downdating via householder-like transformations

ACM Transactions on Mathematical Software (TOMS)
Tiled QR factorization algorithms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
MR3-SMP: A symmetric tridiagonal eigensolver for multi-core architectures

Parallel Computing
A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures

ACM Transactions on Mathematical Software (TOMS)
The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Journal of Parallel and Distributed Computing
Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Concurrency and Computation: Practice & Experience
DVFS-control techniques for dense linear algebra operations on multi-core processors

Computer Science - Research and Development
Accelerating Linear System Solutions Using Randomization Techniques

ACM Transactions on Mathematical Software (TOMS)
Elemental: A New Framework for Distributed Memory Dense Matrix Computations

ACM Transactions on Mathematical Software (TOMS)
Hierarchical QR factorization algorithms for multi-core clusters

Parallel Computing
Multifrontal QR factorization for multicore architectures over runtime systems

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Energy-efficient execution of dense linear algebra algorithms on multi-core processors

Cluster Computing
Preliminary performance evaluations of the determinant quantum Monte Carlo simulations for multi-core CPU and many-core GPU

International Journal of Computational Science and Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievable.