The general matrix multiply-add operation on 2D torus

Authors:
Ahmed S. Zekri;Stanislav G. Sedukhin
Affiliations:
The Graduate School of Computer Science and Engineering, The University of Aizu, Aizu-Wakamatsu City, Fukushima, Japan;The Graduate School of Computer Science and Engineering, The University of Aizu, Aizu-Wakamatsu City, Fukushima, Japan
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 13
Cited 3

VLSI array processors

VLSI array processors
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Digital Signal Processing for Multimedia Systems

Digital Signal Processing for Multimedia Systems
Constructive Methods for Scheduling Uniform Loop Nests

IEEE Transactions on Parallel and Distributed Systems
Doubly twisted torus networks for VLSI processor arrays

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
SUMMA: Scalable Universal Matrix Multiplication Algorithm

SUMMA: Scalable Universal Matrix Multiplication Algorithm
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
Computationally efficient parallel matrix-matrix multiplication on the torus

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems

Performance evaluation of basic linear algebra subroutines on a matrix co-processor

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Bamboo: translating MPI applications to a latency-tolerant, data-driven form

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Architecture and design of high-throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip (3D-NoC)

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, the index space of the (n×n)-matrix multiply-add problem C = C +AċB is represented as a 3D n×n×n torus. All possible time-scheduling functions to activate the computation and data rolling inside the 3D torus index space are determined. To maximize efficiency when solving a single problem, we mapped the computations into the 2D n×n toroidal array processor. All optimal 2D data allocations that solve the problem in n multiply-add-roll steps are obtained. The well known Cannon's algorithm is one of the resulting allocations. We used the optimal data allocations to describe all variants of the GEMM operation on the 2D toroidal array processor. By controling the data movement, the transposition operation is avoided in 75% of the GEMM variants. However, only one matrix transpose is needed for the remaining 25%. Ultimately, we described four versions of the GEMM operation covering the possible layouts of the initially loaded data into the array processor.