The impact of data distribution in accuracy and performance of parallel linear algebra subroutines

Authors:
Björn Rocker;Mariana Kolberg;Vincent Heuveline
Affiliations:
Karlsruhe Institute of Technology, Engineering Mathematics and Computing Lab, Karlsruhe, Germany;Universidade Luterana do Brasil, Canoas, RS, Brasil;Karlsruhe Institute of Technology, Engineering Mathematics and Computing Lab, Karlsruhe, Germany
Venue:
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Year:
2010

Citing 11
Cited 0

Accurate floating-point summation

Communications of the ACM
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
C++ Toolbox for Verified Scientific Computing I: Basic Numerical Problems

C++ Toolbox for Verified Scientific Computing I: Basic Numerical Problems
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Computer Arithmetic in Theory and Practice

Computer Arithmetic in Theory and Practice
C-XSC: A C++ Class Library for Extended Scientific Computing

C-XSC: A C++ Class Library for Extended Scientific Computing
LAPACK Working Note 58: ``The Design of Linear Algebra Libraries for High Performance Computers

LAPACK Working Note 58: ``The Design of Linear Algebra Libraries for High Performance Computers
Parallel Matrix Distributions: Have we been doing it all right?

Parallel Matrix Distributions: Have we been doing it all right?
Error bounds from extra-precise iterative refinement

ACM Transactions on Mathematical Software (TOMS)
Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Fast linear algebra is stable

Numerische Mathematik

Quantified Score

Hi-index	0.00

Visualization

Abstract

In parallel computing the data distribution may have a significant impact in the application performance and accuracy. These effects can be observed using the parallel matrix-vector multiplication routine from PBLAS with different grid configurations in data distribution. Matrix-vector multiplication is an especially important operation once it is widely used in numerical simulation (e.g., iterative solvers for linear systems of equations). This paper presents a mathematical background of error propagation in elementary operations and proposes benchmarks to show how different grid configurations based on the two dimensional cyclic block distribution impacts accuracy and performance using parallel matrix-vector operations. The experimental results validate the theoretical findings.