Algorithmic Redistribution Methods for Block-Cyclic Decompositions

Authors:
Antoine P. Petitet;Jack J. Dongarra
Affiliations:
Univ. of Tennessee, Knoxville;Univ. of Tennessee, Knoxville
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1999

Citing 34
Cited 17

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Parallel solution of triangular systems on distributed-memory multiprocessors

SIAM Journal on Scientific and Statistical Computing
A new method for solving triangular systems on distributed-memory message-passing multiprocessors

SIAM Journal on Scientific and Statistical Computing
QR factorization of a dense matrix on a hypercube multiprocessor

SIAM Journal on Scientific and Statistical Computing
Experiments with multicomputer LU-decomposition

Concurrency: Practice and Experience
Parallel triangular system solving on a mesh network of transputers

SIAM Journal on Scientific and Statistical Computing
Introduction to parallel computing: design and analysis of algorithms

Introduction to parallel computing: design and analysis of algorithms
Block-cyclic dense linear algebra

SIAM Journal on Scientific Computing
The high performance Fortran handbook

The high performance Fortran handbook
A parallel block implementation of Level-3 BLAS for MIMD vector processors

ACM Transactions on Mathematical Software (TOMS)
Scalability issues affecting the design of a dense linear algebra library

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Multiplication of matrices of arbitrary shape on a data parallel computer

Parallel Computing
The torus-wrap mapping for dense matrix calculations on massively parallel computers

SIAM Journal on Scientific Computing
Generating local addresses and communication sets for data-parallel programs

Journal of Parallel and Distributed Computing
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
SP2 system architecture

IBM Systems Journal
The SP2 high-performance switch

IBM Systems Journal
A linear-time algorithm for computing the memory access sequence in data-parallel programs

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Software libraries for linear algebra computations on high performance computers

SIAM Review
Processor Mapping Techniques Toward Efficient Data Redistribution

IEEE Transactions on Parallel and Distributed Systems
ScaLAPACK user's guide

ScaLAPACK user's guide
Fast runtime block cyclic data redistribution on multiprocessors

Journal of Parallel and Distributed Computing
Scheduling Block-Cyclic Array Redistribution

IEEE Transactions on Parallel and Distributed Systems
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Runtime performance of parallel array assignment: an empirical study

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Fast Address Sequence Generation for Data-Parallel Programs Using Integer Lattices

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Parallel LU Decomposition on a Transputer Network

Proceedings of the Shell Conference on Parallel Computing
LAPACK Working Note 79: Parallelizing the Q R Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality

LAPACK Working Note 79: Parallelizing the Q R Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality
GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark

GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark
LAPACK Working Note 94: A User''s Guide to the BLACS v1.0

LAPACK Working Note 94: A User''s Guide to the BLACS v1.0
Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology

Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology
Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software
A New Parallel Matrix Multiplication Algorithm onDistributed-Memory Concurrent Computers

A New Parallel Matrix Multiplication Algorithm onDistributed-Memory Concurrent Computers
Algorithmic redistribution methods for block cyclic decompositions

Algorithmic redistribution methods for block cyclic decompositions

A Columnwise Block Striping in Neville Elimination

PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
QR factorization for shared memory and message passing

Parallel Computing
Efficient communication sets generation for block-cyclic distribution on distributed-memory machines

Journal of Systems Architecture: the EUROMICRO Journal
A Compressed Diagonals Remapping Technique for Dynamic Data Redistribution on Banded Sparse Matrix

The Journal of Supercomputing
Sparse Matrix Block-Cyclic Realignment on Distributed Memory Machines

The Journal of Supercomputing
A pipeline technique for dynamic data transfer on a multiprocessor grid

International Journal of Parallel Programming
Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers

IEEE Transactions on Parallel and Distributed Systems
Data Partitioning with a Functional Performance Model of Heterogeneous Processors

International Journal of High Performance Computing Applications
Scheduling contention-free irregular redistributions in parallelizing compilers

The Journal of Supercomputing
Scalability of Neville elimination using checkerboard partitioning

International Journal of Computer Mathematics - Recent Advances in Computational and Applied Mathematics in Science and Engineering
A compressed diagonals remapping technique for dynamic data redistribution on banded sparse matrix

ISPA'03 Proceedings of the 2003 international conference on Parallel and distributed processing and applications
Two-dimensional matrix partitioning for parallel computing on heterogeneous processors based on their functional performance models

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
HeteroMPI+ScaLAPACK: towards a ScaLAPACK (dense linear solvers) on heterogeneous networks of computers

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Efficient communication scheduling methods for irregular data redistribution in parallelizing compilers

PaCT'05 Proceedings of the 8th international conference on Parallel Computing Technologies
Irregular redistribution scheduling by partitioning messages

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
ISO: comprehensive techniques toward efficient gen_block redistribution with multidimensional arrays

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
On the complexity of the max-edge-coloring problem with its variants

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents various data redistribution methods for block-partitioned linear algebra algorithms operating on dense matrices that are distributed in a block-cyclic fashion. Because the algorithmic partitioning unit and the distribution blocking factor are most often chosen to be equal, severe alignment restrictions are induced on the operands, and optimal values with respect to performance are architecture dependent. The techniques presented in this paper redistribute data 驴on the fly,驴 so that the user's data distribution blocking factor becomes independent from the architecture dependent algorithmic partitioning. These techniques are applied to the matrix-matrix multiplication operation. A performance analysis along with experimental results shows that alignment restrictions can then be removed and that high performance can be maintained across platforms independently from the user's data distribution blocking factor.