Scalable parallelization of FLAME code via the workqueuing model

Authors:
Field G. Van Zee;Paolo Bientinesi;Tze Meng Low;Robert A. van de Geijn
Affiliations:
The University of Texas at Austin, Austin, TX;Duke University, Durham, NC;The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2008

Citing 12
Cited 10

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide

LAPACK's user's guide
Using PLAPACK: parallel linear algebra package

Using PLAPACK: parallel linear algebra package
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Formal derivation of algorithms: The triangular sylvester equation

ACM Transactions on Mathematical Software (TOMS)
Approximation algorithms for combinatorial problems

STOC '73 Proceedings of the fifth annual ACM symposium on Theory of computing
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Representing linear algebra algorithms in code: the FLAME application program interfaces

ACM Transactions on Mathematical Software (TOMS)
Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Mechanical derivation and systematic analysis of correct linear algebra algorithms

Mechanical derivation and systematic analysis of correct linear algebra algorithms
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A Proposal for Task Parallelism in OpenMP

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Performance Evaluation of a Multi-zone Application in Different OpenMP Approaches

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
An Experimental Evaluation of the New OpenMP Tasking Model

Languages and Compilers for Parallel Computing
Performance evaluation of a multi-zone application in different OpenMP approaches

International Journal of Parallel Programming
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
MR3-SMP: A symmetric tridiagonal eigensolver for multi-core architectures

Parallel Computing
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Parallelizing dense linear algebra operations with task queues in llc

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.