Parallel and vector computing: a practical introduction
Parallel and vector computing: a practical introduction
A VLSI inner product macrocell
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Low Power Digital CMOS Design
Computer
Multiplying streams of large matrices in parallel and distributed environment
CCSC '00 Proceedings of the fifth annual CCSC northeastern conference on The journal of computing in small colleges
A Non-binary Parallel Arithmetic Architecture
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Trading Bitwidth For Array Size: A Unified Reconfigurable Arithmetic Processor Design
ISQED '01 Proceedings of the 2nd International Symposium on Quality Electronic Design
Hi-index | 0.00 |
A novel reconfiguable low-power high-performance matrix multiplier architecture and its component circuits are presented. The processor can be easily reconfigured to compute the product of matrices Xnk and Ykm for any integers n, k, m and any item precision b (ranging from 4 to 64 bits) thus maximizing the utilization of the hardware available.As a typical example, the hardware equivalent to one 64 x 64 bit high precision multiplier in the system can be directly reconfigured to produce the product of two matrices X(8x8) and Y(8x8) of 8-bit items in 9 pipeline cycles, which would require 512 multiplications (done by large multipliers) in a non-reconfigurable high precision system.Given an input stream of h x h matrix pairs with b-bit items, the processor, called matrix multiplier of size s (note s=hb), may consist of an array of (s / m)2 of m x m small multipliers (m=4 case is illustrated), a few arrays of adders each adding three numbers, an array of accumulators and corresponding simple reconfiguration switches. To compute the product of Xnk and Ykm of item precision b on the proposed processor of size s we only need to partition Xnk and Ykm into (s/b) x (s/b) sub-matrices, reconfigure the processor according to the values of s (fixed) and b (input parameter), compute the products of sub-matrices, and accumulate them for the desired result in pipelined fashion.A recently proposed shift switch logic, a non-binary logic for arithmetic circuits, is utilized in the design. The novel logic operates 4-bit state signals where no more than half of the signal bits are subject to value-change at any logic stage, which, verified by SPICE simulation, significantly reduces the large circuit power dissipation while keeping high performance in speed and small VLSI area.