Scaling LAPACK panel operations using parallel cache assignment

Authors:
Anthony M. Castaldo;R. Clint Whaley;Siju Samuel
Affiliations:
University of Texas at San Antonio, TX;University of Texas at San Antonio, TX;University of Texas at San Antonio, TX
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2013

Citing 29
Cited 0

The WY representation for products of householder matrices

SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Locality of Reference in LU Decomposition with Partial Pivoting

SIAM Journal on Matrix Analysis and Applications
Improving the memory-system performance of sparse-matrix vector multiplication

IBM Journal of Research and Development
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
A Storage Efficient WY Representation for Products of Householder Transformations

A Storage Efficient WY Representation for Products of Householder Transformations
Parallel out-of-core computation and updating of the QR factorization

ACM Transactions on Mathematical Software (TOMS)
Minimizing development and maintenance costs in supporting persistently optimized BLAS

Software—Practice & Experience - Research Articles
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Scalable parallelization of FLAME code via the workqueuing model

ACM Transactions on Mathematical Software (TOMS)
SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Parallel tiled QR factorization for multicore architectures

Concurrency and Computation: Practice & Experience
Achieving accurate and context-sensitive timing for code optimization

Software—Practice & Experience
Minimizing startup costs for performance-critical threading

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Out-of-Core Computation of the QR Factorization on Multi-core Processors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Applying recursion to serial and parallel QR factorization leads to better performance

IBM Journal of Research and Development
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
The impact of multicore on math software

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
In-place transposition of rectangular matrices

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor

IEEE Micro
Achieving Scalable Parallelization for the Hessenberg Factorization

CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl's law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach to the panel factorization which we show scales well with p. We apply this general approach to the QR, QL, RQ, LQ and LU panel factorizations. We show results for two commodity platforms: an 8-core Intel platform and a 32-core AMD platform. For both platforms and all twenty implementations (five factorizations each of which is available in 4 types), we present results that demonstrate that our approach yields significant speedup over the existing state of the art.