Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Parallel solution of triangular systems on distributed-memory multiprocessors
SIAM Journal on Scientific and Statistical Computing
A new method for solving triangular systems on distributed-memory message-passing multiprocessors
SIAM Journal on Scientific and Statistical Computing
QR factorization of a dense matrix on a hypercube multiprocessor
SIAM Journal on Scientific and Statistical Computing
Experiments with multicomputer LU-decomposition
Concurrency: Practice and Experience
Parallel triangular system solving on a mesh network of transputers
SIAM Journal on Scientific and Statistical Computing
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Block-cyclic dense linear algebra
SIAM Journal on Scientific Computing
The high performance Fortran handbook
The high performance Fortran handbook
A parallel block implementation of Level-3 BLAS for MIMD vector processors
ACM Transactions on Mathematical Software (TOMS)
Scalability issues affecting the design of a dense linear algebra library
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Multiplication of matrices of arbitrary shape on a data parallel computer
Parallel Computing
The torus-wrap mapping for dense matrix calculations on massively parallel computers
SIAM Journal on Scientific Computing
Generating local addresses and communication sets for data-parallel programs
Journal of Parallel and Distributed Computing
IBM Journal of Research and Development
IBM Systems Journal
The SP2 high-performance switch
IBM Systems Journal
A linear-time algorithm for computing the memory access sequence in data-parallel programs
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Processor Mapping Techniques Toward Efficient Data Redistribution
IEEE Transactions on Parallel and Distributed Systems
ScaLAPACK user's guide
Fast runtime block cyclic data redistribution on multiprocessors
Journal of Parallel and Distributed Computing
Scheduling Block-Cyclic Array Redistribution
IEEE Transactions on Parallel and Distributed Systems
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Runtime performance of parallel array assignment: an empirical study
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Fast Address Sequence Generation for Data-Parallel Programs Using Integer Lattices
LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Parallel LU Decomposition on a Transputer Network
Proceedings of the Shell Conference on Parallel Computing
LAPACK Working Note 79: Parallelizing the Q R Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality
GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark
GEEM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark
LAPACK Working Note 94: A User''s Guide to the BLACS v1.0
LAPACK Working Note 94: A User''s Guide to the BLACS v1.0
Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology
Optimizing Matrix Multiply using PHiPAC: a Portable,High-Performance, ANSI C Coding Methodology
Automatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software
A New Parallel Matrix Multiplication Algorithm onDistributed-Memory Concurrent Computers
A New Parallel Matrix Multiplication Algorithm onDistributed-Memory Concurrent Computers
Algorithmic redistribution methods for block cyclic decompositions
Algorithmic redistribution methods for block cyclic decompositions
A Columnwise Block Striping in Neville Elimination
PPAM '01 Proceedings of the th International Conference on Parallel Processing and Applied Mathematics-Revised Papers
QR factorization for shared memory and message passing
Parallel Computing
Efficient communication sets generation for block-cyclic distribution on distributed-memory machines
Journal of Systems Architecture: the EUROMICRO Journal
A Compressed Diagonals Remapping Technique for Dynamic Data Redistribution on Banded Sparse Matrix
The Journal of Supercomputing
Sparse Matrix Block-Cyclic Realignment on Distributed Memory Machines
The Journal of Supercomputing
A pipeline technique for dynamic data transfer on a multiprocessor grid
International Journal of Parallel Programming
IEEE Transactions on Parallel and Distributed Systems
Data Partitioning with a Functional Performance Model of Heterogeneous Processors
International Journal of High Performance Computing Applications
Scheduling contention-free irregular redistributions in parallelizing compilers
The Journal of Supercomputing
Scalability of Neville elimination using checkerboard partitioning
International Journal of Computer Mathematics - Recent Advances in Computational and Applied Mathematics in Science and Engineering
A compressed diagonals remapping technique for dynamic data redistribution on banded sparse matrix
ISPA'03 Proceedings of the 2003 international conference on Parallel and distributed processing and applications
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
PaCT'05 Proceedings of the 8th international conference on Parallel Computing Technologies
Irregular redistribution scheduling by partitioning messages
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
ISO: comprehensive techniques toward efficient gen_block redistribution with multidimensional arrays
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
On the complexity of the max-edge-coloring problem with its variants
ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Hi-index | 0.00 |
This article presents various data redistribution methods for block-partitioned linear algebra algorithms operating on dense matrices that are distributed in a block-cyclic fashion. Because the algorithmic partitioning unit and the distribution blocking factor are most often chosen to be equal, severe alignment restrictions are induced on the operands, and optimal values with respect to performance are architecture dependent. The techniques presented in this paper redistribute data 驴on the fly,驴 so that the user's data distribution blocking factor becomes independent from the architecture dependent algorithmic partitioning. These techniques are applied to the matrix-matrix multiplication operation. A performance analysis along with experimental results shows that alignment restrictions can then be removed and that high performance can be maintained across platforms independently from the user's data distribution blocking factor.