VLSI array processors
An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
IBM Journal of Research and Development
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Digital Signal Processing for Multimedia Systems
Digital Signal Processing for Multimedia Systems
Constructive Methods for Scheduling Uniform Loop Nests
IEEE Transactions on Parallel and Distributed Systems
Doubly twisted torus networks for VLSI processor arrays
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
SUMMA: Scalable Universal Matrix Multiplication Algorithm
SUMMA: Scalable Universal Matrix Multiplication Algorithm
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Principles and Practices of Interconnection Networks
Principles and Practices of Interconnection Networks
Computationally efficient parallel matrix-matrix multiplication on the torus
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Performance evaluation of basic linear algebra subroutines on a matrix co-processor
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Bamboo: translating MPI applications to a latency-tolerant, data-driven form
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
In this paper, the index space of the (n×n)-matrix multiply-add problem C = C +AċB is represented as a 3D n×n×n torus. All possible time-scheduling functions to activate the computation and data rolling inside the 3D torus index space are determined. To maximize efficiency when solving a single problem, we mapped the computations into the 2D n×n toroidal array processor. All optimal 2D data allocations that solve the problem in n multiply-add-roll steps are obtained. The well known Cannon's algorithm is one of the resulting allocations. We used the optimal data allocations to describe all variants of the GEMM operation on the 2D toroidal array processor. By controling the data movement, the transposition operation is avoided in 75% of the GEMM variants. However, only one matrix transpose is needed for the remaining 25%. Ultimately, we described four versions of the GEMM operation covering the possible layouts of the initially loaded data into the array processor.