Division and Square Root: Choosing the Right Implementation

  • Authors:
  • Peter Soderquist;Miriam Leeser

  • Affiliations:
  • -;-

  • Venue:
  • IEEE Micro
  • Year:
  • 1997

Quantified Score

Hi-index 0.01

Visualization

Abstract

Floating-point support has become a mandatory feature of new microprocessors. Over the past few years, the leading architectures have seen several generations of floating-point units (FPU's). While addition and multiplication implementations have become increasingly efficient, division and square root support has remained uneven. There is a considerable variation in the types of algorithms employed, as well as the quality and performance of the implementations. This situation originates in skepticism about the importance of division and square root, and an insufficient understanding of the design alternatives. The purpose of this paper is to clarify and evaluate the implementation tradeoffs at the FPU level, thus enabling designers to make informed decisions. Division and square root have long been considered minor, bothersome members of the floating-point family. Microprocessor designers frequently perceive them as infrequent, low-priority operations, barely worth the trouble of implementing; design effort and chip resources are allocated accordingly. The survey of microprocessor FPU performance in Table 1 shows some of the uneven results of this philosophy. While multiplication requires from 2 to 5 machine cycles, division latencies range from 9 to 60. The variation is even greater for square root, which is not supported in hardware in several cases. This data hints at but mostly conceals the significant variation in algorithms and topologies among the different implementations. The error in the Intel Pentium floating-point unit, and the accompanying publicity and 475 million write-off illustrate some of the hazards of an incorrect division implementation. But correctness is not enough; low performance causes enough problems of its own. Even though divide and square root are relatively infrequent operations in most applications, they are indispensable, particularly in many scientific programs. Compiler optimizations tend to increase the frequency of these operations; poor implementations disproportionately penalize code which uses them at all. Furthermore, as the latency gap grows between addition and multiplication on the one hand and divide/square root on the other, the latter increasingly become performance bottlenecks. Programmers have attempted to get around this problem by rewriting algorithms to avoid divide/square root operations, but the resulting code generally suffers from poor numerical properties, such as instability or overflow. In short, division and square root are natural components of many algorithms, which are best served by implementations with good performance. Quantifying what constitutes "good performance" is challenging. One rule of thumb, for example, states that the latency of division should be three times that of multiplication; this figure is based on division frequencies in a selection of typical scientific applications. Even if one accepts this doctrine at face value, implementing division--and square root--involves much more than relative latencies. Area, throughput, complexity, and the interaction with other operations must be considered as well. This article explores the various tradeoffs involved and illuminates the consequences of different design choices.