On the Precision Attainable with Various Floating-Point Number Systems

Authors:
Richard P. Brent
Affiliations:
Mathematical Sciences Department, IBM T. J. Watson Research Center, Yorktown Heights, N.Y. 10598./ Computer Centre, Australian National University, Canberra, A.C.T., Australia.
Venue:
IEEE Transactions on Computers
Year:
1973

Citing 13
Cited 4

Accumulation of Round-Off Error in Fast Fourier Transforms

Journal of the ACM (JACM)
A Mean Square Estimate of the Generated Roundoff Error in Constant Matrix Iterative Processes

Journal of the ACM (JACM)
A statistical study of the accuracy of floating point number systems

Communications of the ACM - Special 25th Anniversary Issue
27 bits are not enough for 8-digit accuracy

Communications of the ACM
The choice of base

Communications of the ACM
Tests of probabilistic models for propagation of roundoff errors

Communications of the ACM
Test of probabilistic models for the propagation of roundoff errors

Communications of the ACM
Floating-point number representations: base choice versus exponent range

Floating-point number representations: base choice versus exponent range
A Formalization of Floating-Point Numeric Base Conversion

IEEE Transactions on Computers
Design of the Arithmetic Units of ILLIAC III: Use of Redundancy and Higher Radix Methods

IEEE Transactions on Computers
Tapered Floating Point: A New Floating-Point Representation

IEEE Transactions on Computers
Static and Dynamic Numerical Characteristics of Floating-Point Arithmetic

IEEE Transactions on Computers
The IBM system/360 model 91: floating-point execution unit

IBM Journal of Research and Development

Error Analysis of Certain Floating-Point On-Line Algorithms

IEEE Transactions on Computers
Analysis of Rounding Methods in Floating-Point Arithmetic

IEEE Transactions on Computers
Computer Representation of Real Numbers

IEEE Transactions on Computers - Lecture notes in computer science Vol. 174
A Survey of Some Recent Contributions to Computer Arithmetic

IEEE Transactions on Computers

Quantified Score

Hi-index	14.99

Visualization

Abstract

For scientific computations on a digital computer the set of real numbers is usually approximated by a finite set F of ``floating-point'' numbers. We compare the numerical accuracy possible with different choices of F having approximately the same range and requiring the same word length. In particular, we compare different choices of base (or radix) in the usual floating-point systems. The emphasis is on the choice of F, not on the details of the number representation or the arithmetic, but both rounded and truncated arithmetic are considered. Theoretical results are given, and some simulations of typical floating-point computations (forming sums, solving systems of linear equations, finding eigenvalues) are described. If the leading fraction bit of a normalized base-2 number is not stored explicitly (saving a bit), and the criterion is to minimize the mean square roundoff error, then base 2 is best. If unnormalized numbers are allowed, so the first bit must be stored explicitly, then base 4 (or sometimes base 8) is the best of the usual systems.