Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modeling

Authors:
Piotr Luszczek;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville, TN;University of Tennessee, Knoxville, TN, USA, Oak Ridge National Laboratory, USA, University of Manchester, United Kingdom
Venue:
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Year:
2011

Citing 13
Cited 0

Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
ScaLAPACK user's guide

ScaLAPACK user's guide
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Verifying large-scale system performance during installation using modelling

High performance scientific and engineering computing
Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Quantifying Locality In The Memory Access Patterns of HPC Applications

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming
Complex version of high performance computing LINPACK benchmark (HPL)

Concurrency and Computation: Practice & Experience
Using experimental data to improve the performance modelling of parallel linear algebra routines

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a modeling framework to accurately predict time to run dense linear algebra calculation. We report the framework's accuracy in a number of varied computational environments such as shared memory multicore systems, clusters, and large supercomputing installations with tens of thousands of cores. We also test the accuracy for various algorithms, each of which having a different scaling properties and tolerance to low-bandwidth/high-latency interconnects. The predictive accuracy is very good and on the order of measurement accuracy which makes the method suitable for both dedicated and non-dedicated environments. We also present a practical application of our model to reduce the time required to tune and optimize large parallel runs whose time is dominated by linear algebra computations. We show practical examples of how to apply the methodology to avoid common pitfalls and reduce the influence of measurement errors and the inherent performance variability.