IEEE Transactions on Computers
IEEE Transactions on Computers
Logical effort: designing fast CMOS circuits
Logical effort: designing fast CMOS circuits
Division and Square Root: Digit-Recurrence Algorithms and Implementations
Division and Square Root: Digit-Recurrence Algorithms and Implementations
Very-High Radix Division with Prescaling and Selection by Rounding
IEEE Transactions on Computers
Fast Radix-4 Retimed Division with Selection by Comparisons
ASAP '02 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
SRT Division Architectures and Implementations
ARITH '97 Proceedings of the 13th Symposium on Computer Arithmetic (ARITH '97)
High Performance Floating-Point Unit with 116 Bit Wide Divider
ARITH '03 Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH-16'03)
A New Iterative Structure for Hardware Division: The Parallel Paths Algorithm
ARITH '03 Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH-16'03)
Revisiting SRT Quotient Digit Selection
ARITH '03 Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH-16'03)
Energy-Delay Estimation Technique for High-Performance Microprocessor VLSI Adders
ARITH '03 Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH-16'03)
IEEE Transactions on Computers
Retiming and clock scheduling for digital circuit optimization
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A Radix-10 Digit-Recurrence Division Unit: Algorithm and Architecture
IEEE Transactions on Computers
Design issues and implementations for floating-point divide-add fused
IEEE Transactions on Circuits and Systems II: Express Briefs
A novel implementation of radix-4 floating-point division/square-root using comparison multiples
Computers and Electrical Engineering
Hi-index | 14.98 |
In this paper, we propose a class of division algorithms with the aim of reducing the delay of the selection of the quotient digit by introducing more concurrency and flexibility in its computation. From the proposed class of algorithms, we select one that moves part of the selection function out of the critical path, with a corresponding reduction in the critical path compared with existing alternatives. We present the algorithm and describe the architectures for radix 4 and for radix 16. For radix 16, we use the scheme of overlapping two radix-4 stages. In both cases, radix 4 and radix 16, we show that our algorithms allow the design of units with well-balanced critical paths with consequent decreases of the cycle times. Moreover, in the radix-16 case, we include some additional speculation techniques. To estimate the speedup, we used a rough timing model based on logical effort. For both radices, we estimate a speedup of about 25 percent with respect to previous implementations. In the radix-4 case, this is achieved by using roughly the same area, while, in the radix-16 case, the area is increased by about 30 percent. We verified our estimations by performing a synthesis of the radix-4 units.