Area and performance tradeoffs in floating-point divide and square-root implementations

Authors:
Peter Soderquist;Miriam Leeser
Affiliations:
Cornell Univ, Ithaca, NY;Northeastern Univ., Boston, MA
Venue:
ACM Computing Surveys (CSUR)
Year:
1996

Citing 29
Cited 12

Computer number systems and arithmetic

Computer number systems and arithmetic
Software implementation of floating-point arithmetic on a reduced-instruction-set

Journal of Parallel and Distributed Computing
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Design of the IBM RISC System/6000 floating-point execution unit

IBM Journal of Research and Development
Computation of elementary functions on the IBM RISC System/6000 processor

IBM Journal of Research and Development
Fast Division Using Accurate Quotient Approximations to Reduce the Number of Iterations

IEEE Transactions on Computers - Special issue on computer arithmetic
The PowerPC 603 microprocessor

Communications of the ACM
Which low-end workstation?

IEEE Spectrum
Introduction to Arithmetic for Digital Systems Designers

Introduction to Arithmetic for Digital Systems Designers
Division and Square Root: Digit-Recurrence Algorithms and Implementations

Division and Square Root: Digit-Recurrence Algorithms and Implementations
Introducing the Intel i860 64-Bit Microprocessor

IEEE Micro
The TMS390C602A Floating-Point Coprocessor for Sparc Systems

IEEE Micro
The Mips R4000 Processor

IEEE Micro
Architecture of the Pentium Microprocessor

IEEE Micro
Performance Features of the PA7100 Microprocessor

IEEE Micro
The Alpha AXP Architecture and 21064 Processor

IEEE Micro
How Does Processor MHZ Relate to End-User Performance? Part I

IEEE Micro
The Power PC 601 Microprocessor

IEEE Micro
The PowerPC 604 RISC microprocessor

IEEE Micro
Accurate Rounding Scheme for the Newton-Raphson Method Using Redundant Binary Representation

IEEE Transactions on Computers
Faithful Bipartite ROM Reciprocal Tables

ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
30-ns 55-b Radix 2 Division and Square Root Using a Self-Timed Circuit

ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
Very-high radix combined division and square root with prescaling and selection by rounding

ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
It Takes Six Ones To Reach a Flaw

ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
UltraSPARC: the next generation superscalar 64-bit SPARC

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Advanced performance features of the 64-bit PA-8000

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Internal architecture of Alpha 21164 microprocessor

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Design Issues in Floating-Point Division

Design Issues in Floating-Point Division
An Analysis of Division Algorithms and Implementations

An Analysis of Division Algorithms and Implementations

Division and Square Root: Choosing the Right Implementation

IEEE Micro
High-Speed Double-Precision Computation of Reciprocal, Division, Square Root and Inverse Square Root

IEEE Transactions on Computers
Analysis of the impact of different methods for division/square root computation in the performance of a superscalar microprocessor

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Synthesis and verification
A parametric error analysis of Goldschmidt's division algorithm

Journal of Computer and System Sciences
Reciprocal and Reciprocal Square Root Units with Operand Modification and Multiplication

Journal of VLSI Signal Processing Systems
Fast decimal floating-point division

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Decimal Floating-Point Divider Using Newton---Raphson Iteration

Journal of VLSI Signal Processing Systems
Floating-point divider design for FPGAs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Custom floating-point unit generation for embedded systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Design issues and implementations for floating-point divide-add fused

IEEE Transactions on Circuits and Systems II: Express Briefs
A novel implementation of radix-4 floating-point division/square-root using comparison multiples

Computers and Electrical Engineering
Novel Pipelined Architecture for Efficient Evaluation of the Square Root Using a Modified Non-Restoring Algorithm

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Floating-point divide and square-root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations. The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high-performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of representative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotation, to show the dynamic performance impact of the various implementation alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the Newton-Raphson method and Goldschmidt's algorithm, can achieve low latencies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to low throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many times more area. For these reasons, multiplicative implementations are best suited to cases where subtractive methods are precluded by area constraints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and radix-16 SRT, can be made to execute in parallel with other floating-point operations.