Computationally efficient parallel matrix-matrix multiplication on the torus

Authors:
Ahmed S. Zekri;Stanislav G. Sedukhin
Affiliations:
Graduate School of Computer Science and Engineering, The University of Aizu, Aizu-Wakamatsu City, Fukushima, Japan;Graduate School of Computer Science and Engineering, The University of Aizu, Aizu-Wakamatsu City, Fukushima, Japan
Venue:
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Year:
2005

Citing 6
Cited 1

VLSI array processors

VLSI array processors
Generalized Cannon's algorithm for parallel matrix multiplication

ICS '97 Proceedings of the 11th international conference on Supercomputing
A Processor-Time-Minimal Systolic Array for Cubical Mesh Algorithms

IEEE Transactions on Parallel and Distributed Systems
Constructive Methods for Scheduling Uniform Loop Nests

IEEE Transactions on Parallel and Distributed Systems
SUMMA: Scalable Universal Matrix Multiplication Algorithm

SUMMA: Scalable Universal Matrix Multiplication Algorithm
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm

The general matrix multiply-add operation on 2D torus

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we represent the computation space of the (n×n)-matrix multiplication problem C=C+AċB as a 3D torus. All possible time-minimal scheduling vectors needed to activate the computations inside the corresponding 3D index points at each step of computing are determined. Using the projection method to allocate the scheduled computations to the processing elements, the resulting array processor that minimizes the computing time is a 2D torus with n×n processing elements. For each optimal time scheduling function, three optimal array allocations are obtained from projection. All the resulting allocations of all the optimal scheduling vectors can be classified into three groups. In one group, matrix C remains and both matrices A and B are shifted between neighbor processors. The well-known Cannon's algorithm belongs to this group. In another group, matrix A remains and both matrices B and C are shifted. In the third group, matrix B remains while both matrices A and C are shifted. The obtained array processor allocations need n compute-shift steps to multiply n×n dense matrices.