On Synthesizing Optimal Family of Linear Systolic Arrays for Matrix Multiplication

Authors:
V. K. Prasanna Kumar;Yu-Chen Tsai
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1991

Citing 5
Cited 9

Synthesis of an Optimal Family of Matrix Multiplication Algorithms on Linear Arrays

IEEE Transactions on Computers
Optimal Graph Algorithms on a Fixed-Size Linear Array

IEEE Transactions on Computers
Information Transfer in Distributed Computing with Applications to VLSI

Journal of the ACM (JACM)
Introduction to VLSI Systems

Introduction to VLSI Systems
Automatic synthesis of systolic arrays from uniform recurrent equations

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture

An Approach to Designing Modular Extensible Linear Arrays for Regular Algorithms

IEEE Transactions on Computers
A Fault-Tolerant GEQRNS Processing Element for Linear Systolic Array DSP Applications

IEEE Transactions on Computers
Energy-Efficient Matrix Multiplication on FPGAs

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Domain-Specific Modeling for Rapid Energy Estimation of Reconfigurable Architectures

The Journal of Supercomputing
Energy-Efficient Computations on FPGAs

The Journal of Supercomputing
Design and implementation of a high-speed matrix multiplier based on word-width decomposition

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Energy- and time-efficient matrix multiplication on FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A model-based extensible framework for efficient application design using FPGA

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A direct method for optimal VLSI realization of deeply nested n-D loop problems

Microprocessors & Microsystems

Quantified Score

Hi-index	14.99

Visualization

Abstract

The authors describe a family of linear systolic arrays for matrix multiplication exhibiting a tradeoff between local storage and the number of processing elements (PEs). The design consists of processors hooked into a linear array with each processor having storage s, 1or=sor=n, for n*n matrix multiplication, where the number of processors equals n times the least integer 驴n/s. The input matrices are fed as two speed data streams using fast and slow channels to satisfy the dependencies in the usual matrix multiplication algorithm. While a family of linear arrays have been synthesized for this problem, this technique leads to simpler designs with fewer number of processors and improved delay from input to output. All these designs use the optimal number of processors for local storage in the range 1or=sor=n. The data flow is unidirectional, which makes the designs implementable on fault wafer scale integration models.