An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization

Authors:
Toshiyuki Imamura
Affiliations:
Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, 2-2-54 Nakameguro, Meguro-ku, Tokyo 153, Japan imamura@koma.jaeri.go.jp
Venue:
The Journal of Supercomputing
Year:
2000

Citing 14
Cited 0

Scientific computing: an introduction with parallel computing

Scientific computing: an introduction with parallel computing
High performance computing

High performance computing
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The torus-wrap mapping for dense matrix calculations on massively parallel computers

SIAM Journal on Scientific Computing
An optimizing Fortran D compiler for MIMD distributed-memory machines

An optimizing Fortran D compiler for MIMD distributed-memory machines
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Building a high-performance collective communication library

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Promising data parallel environment-ADEPS, ADETRAN and ADENA

PAS '95 Proceedings of the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis
LAPACK Working Note 67: Performance Complexity of LU Factorization with Efficient Pipelining and Overlap on a Multiprocessor

LAPACK Working Note 67: Performance Complexity of LU Factorization with Efficient Pipelining and Overlap on a Multiprocessor
Performance of Various Computers Using Standard Linear Equations Software

Performance of Various Computers Using Standard Linear Equations Software
LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers

LAPACK Working Note 20: A Portable Linear Algebra Library For High-Performance Computers
The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines

The Design and Implementation of the Parallel Out-of-coreScaLAPACK LU, QR, and Cholesky Factorization Routines
Algorithmic redistribution methods for block cyclic decompositions

Algorithmic redistribution methods for block cyclic decompositions

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Vertical Block–cyclic Distributed Parallel LU Factorization Method (VBPLU) is effectively processed on a distributed memory parallel computer. VBPLU is based on the two techniques, the block algorithm and the aggregation of communications. Since startup time dominates the data communication and the aggregation reduces communication isssues, the total performance has been much improved. Furthermore this method uses long vectors so that it is also advantageous on vector processors. In this paper, we have constructed a modeling of VBPLU using a simplified LogGP model with analytical formulae, and estimated accurately the computational cost taking into account load distributions caused by data layout and process mapping. Some knowledge for optimization of block algorithm has been obtained. Our estimations have been verified through numerical experiments on three different distributed memory parallel computers.