Synthesis of an Optimal Family of Matrix Multiplication Algorithms on Linear Arrays
IEEE Transactions on Computers
Optimal Graph Algorithms on a Fixed-Size Linear Array
IEEE Transactions on Computers
Information Transfer in Distributed Computing with Applications to VLSI
Journal of the ACM (JACM)
Introduction to VLSI Systems
Automatic synthesis of systolic arrays from uniform recurrent equations
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An Approach to Designing Modular Extensible Linear Arrays for Regular Algorithms
IEEE Transactions on Computers
A Fault-Tolerant GEQRNS Processing Element for Linear Systolic Array DSP Applications
IEEE Transactions on Computers
Energy-Efficient Matrix Multiplication on FPGAs
FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
Domain-Specific Modeling for Rapid Energy Estimation of Reconfigurable Architectures
The Journal of Supercomputing
Energy-Efficient Computations on FPGAs
The Journal of Supercomputing
Design and implementation of a high-speed matrix multiplier based on word-width decomposition
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Energy- and time-efficient matrix multiplication on FPGAs
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A model-based extensible framework for efficient application design using FPGA
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A direct method for optimal VLSI realization of deeply nested n-D loop problems
Microprocessors & Microsystems
Hi-index | 14.99 |
The authors describe a family of linear systolic arrays for matrix multiplication exhibiting a tradeoff between local storage and the number of processing elements (PEs). The design consists of processors hooked into a linear array with each processor having storage s, 1or=sor=n, for n*n matrix multiplication, where the number of processors equals n times the least integer 驴n/s. The input matrices are fed as two speed data streams using fast and slow channels to satisfy the dependencies in the usual matrix multiplication algorithm. While a family of linear arrays have been synthesized for this problem, this technique leads to simpler designs with fewer number of processors and improved delay from input to output. All these designs use the optimal number of processors for local storage in the range 1or=sor=n. The data flow is unidirectional, which makes the designs implementable on fault wafer scale integration models.