High-Speed Booth Encoded Parallel Multiplier Design
IEEE Transactions on Computers - Special issue on computer arithmetic
Implementing Multiply-Accumulate Operation in Multiplication Time
ARITH '97 Proceedings of the 13th Symposium on Computer Arithmetic (ARITH '97)
Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
An Algorithmic Approach for Generic Parallel Adders
Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
An Efficient Twin-Precision Multiplier
ICCD '04 Proceedings of the IEEE International Conference on Computer Design
A Two's Complement Parallel Array Multiplication Algorithm
IEEE Transactions on Computers
FlexCore: Utilizing Exposed Datapath Control for Efficient Computing
Journal of Signal Processing Systems
Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations
IEEE Transactions on Computers
Multiplication acceleration through twin precision
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Design of power-efficient configurable booth multiplier
IEEE Transactions on Circuits and Systems Part I: Regular Papers
Low power energy efficient pipelined multiply-accumulate architecture
Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Hi-index | 0.00 |
We propose a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports two's complement numbers, and includes accumulation guard bits and saturation circuitry. The first MAC pipeline stage contains only partial-product generation circuitry and a reduction tree, while the second stage, thanks to a special sign-extension solution, implements all other functionality. Place-and-route evaluations using a 65-nm 1.1-V cell library show that the proposed architecture offers a 31% improvement in speed and a 32% reduction in energy per operation, averaged across operand sizes of 16, 32, 48, and 64 bits, over a reference two-cycle MAC architecture that employs a multiplier in the first stage and an accumulator in the second. When operating the proposed architecture at the lower frequency of the reference architecture the available timing slack can be used to downsize gates, resulting in a 52% reduction in energy compared to the reference. We extend the new architecture to create a versatile double-throughput MAC (DTMAC) unit that efficiently performs either multiply-accumulate or multiply operations for N-bit, 1 × N/2-bit, or 2 × N/2-bit operands. In comparison to a fixed-function 32-bit MAC unit, 16-bit multiply-accumulate operations can be executed with 67% higher energy efficiency on a 32-bit DTMAC unit.