Architecture and Implementation of a Vector/SIMD Multiply-Accumulate Unit

Authors:
Albert Danysh;Dimitri Tan
Affiliations:
IEEE;IEEE
Venue:
IEEE Transactions on Computers
Year:
2005

Citing 6
Cited 6

Hard-Wired Multipliers with Encoded Partial Products

IEEE Transactions on Computers
Fast multiplication: algorithms and implementation

Fast multiplication: algorithms and implementation
Simple vector microprocessors for multimedia applications

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Computer arithmetic: algorithms and hardware designs

Computer arithmetic: algorithms and hardware designs
PAPA - Packed Arithmetic on a Prefix Adder for Multimedia Applications

ASAP '02 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
A Low-Power, High-Speed Implementation of a PowerPC(tm) Microprocessor Vector Extension

ARITH '99 Proceedings of the 14th IEEE Symposium on Computer Arithmetic

A low-power multiplier with the spurious power suppression technique

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Front-end PPA architecture for three input multiplier

MATH'06 Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICS
A spurious-power suppression technique for multimedia/DSP applications

IEEE Transactions on Circuits and Systems Part I: Regular Papers
Multiplication acceleration through twin precision

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Parallel merged multiplier-accumulator coprocessor optimized for digital filters

Computers and Electrical Engineering
Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

Parallel Computing

Quantified Score

Hi-index	14.98

Visualization

Abstract

This paper presents a 64-bit fixed-point vector multiply-accumulator (MAC) architecture capable of supporting multiple precisions. The vector MAC can perform one 64\times64, two 32\times32, four 16\times16, or eight 8\times8 bit signed/unsigned multiply-accumulates using essentially the same hardware as a scalar 64-bit MAC and with only a small increase in delay. The scalar MAC architecture is "vectorized驴 by inserting mode-dependent multiplexing into the partial product generation and by inserting mode-dependent kills in the carry chain of the reduction tree and the final carry-propagate adder. This is an example of "shared segmentation驴 in which the existing scalar structure is segmented and then shared between vector modes. The vector MAC is area efficient and can be fully pipelined, which makes it suitable for high-performance processors and, possibly, dynamically reconfigurable processors. The "shared segmentation驴 method is compared to an alternative method, referred to as the "shared subtree驴 method, by implementing vector MAC designs using two different technologies and three different vector widths.