Computer number systems and arithmetic
Computer number systems and arithmetic
Area and performance tradeoffs in floating-point divide and square-root implementations
ACM Computing Surveys (CSUR)
Matrix computations (3rd ed.)
Division and Square Root: Digit-Recurrence Algorithms and Implementations
Division and Square Root: Digit-Recurrence Algorithms and Implementations
An Area/Performance Comparison of Subtractive and Multiplicative Divide/Square Root Implementations
ARITH '95 Proceedings of the 12th Symposium on Computer Arithmetic
Design Issues in Floating-Point Division
Design Issues in Floating-Point Division
On the scheduling of variable latency functional units
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Synthesis and verification
More accuracy at fixed precision
Journal of Computational and Applied Mathematics - Special issue: Proceedings of the international conference on linear algebra and arithmetic, Rabat, Morocco, 28-31 May 2001
A Cost-Effective Pipelined Divider with a Small Lookup Table
IEEE Transactions on Computers
Interactive presentation: Radix 4 SRT division with quotient prediction and operand scaling
Proceedings of the conference on Design, automation and test in Europe
Floating-point division and square root using a Taylor-series expansion algorithm
Microelectronics Journal
WCNC'09 Proceedings of the 2009 IEEE conference on Wireless Communications & Networking Conference
ColSpace: towards algorithm/implementation co-optimization
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Journal of Signal Processing Systems
Hi-index | 0.01 |
Floating-point support has become a mandatory feature of new microprocessors. Over the past few years, the leading architectures have seen several generations of floating-point units (FPU's). While addition and multiplication implementations have become increasingly efficient, division and square root support has remained uneven. There is a considerable variation in the types of algorithms employed, as well as the quality and performance of the implementations. This situation originates in skepticism about the importance of division and square root, and an insufficient understanding of the design alternatives. The purpose of this paper is to clarify and evaluate the implementation tradeoffs at the FPU level, thus enabling designers to make informed decisions. Division and square root have long been considered minor, bothersome members of the floating-point family. Microprocessor designers frequently perceive them as infrequent, low-priority operations, barely worth the trouble of implementing; design effort and chip resources are allocated accordingly. The survey of microprocessor FPU performance in Table 1 shows some of the uneven results of this philosophy. While multiplication requires from 2 to 5 machine cycles, division latencies range from 9 to 60. The variation is even greater for square root, which is not supported in hardware in several cases. This data hints at but mostly conceals the significant variation in algorithms and topologies among the different implementations. The error in the Intel Pentium floating-point unit, and the accompanying publicity and 475 million write-off illustrate some of the hazards of an incorrect division implementation. But correctness is not enough; low performance causes enough problems of its own. Even though divide and square root are relatively infrequent operations in most applications, they are indispensable, particularly in many scientific programs. Compiler optimizations tend to increase the frequency of these operations; poor implementations disproportionately penalize code which uses them at all. Furthermore, as the latency gap grows between addition and multiplication on the one hand and divide/square root on the other, the latter increasingly become performance bottlenecks. Programmers have attempted to get around this problem by rewriting algorithms to avoid divide/square root operations, but the resulting code generally suffers from poor numerical properties, such as instability or overflow. In short, division and square root are natural components of many algorithms, which are best served by implementations with good performance. Quantifying what constitutes "good performance" is challenging. One rule of thumb, for example, states that the latency of division should be three times that of multiplication; this figure is based on division frequencies in a selection of typical scientific applications. Even if one accepts this doctrine at face value, implementing division--and square root--involves much more than relative latencies. Area, throughput, complexity, and the interaction with other operations must be considered as well. This article explores the various tradeoffs involved and illuminates the consequences of different design choices.