Design, implementation and testing of extended and mixed precision BLAS

Authors:
Xiaoye S. Li;James W. Demmel;David H. Bailey;Greg Henry;Yozo Hida;Jimmy Iskandar;William Kahan;Suh Y. Kang;Anil Kapur;Michael C. Martin;Brandon J. Thompson;Teresa Tung;Daniel J. Yoo
Affiliations:
Lawrence Berkeley National Laboratory;University of California, Berkeley, CA;Lawrence Berkeley National Laboratory;Intel Corporation, Hillsboro, OR;University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA;University of California, Berkeley, CA
Venue:
ACM Transactions on Mathematical Software (TOMS)
Year:
2002

Citing 18
Cited 35

GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems

SIAM Journal on Scientific and Statistical Computing
An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Mathematica: a system for doing mathematics by computer

Mathematica: a system for doing mathematics by computer
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Efficient high accuracy solutions with GMRES(m)

SIAM Journal on Scientific and Statistical Computing
A Fortran 90-based multiprecision system

ACM Transactions on Mathematical Software (TOMS)
Maple V: programming guide

Maple V: programming guide
Applied numerical linear algebra

Applied numerical linear algebra
ScaLAPACK user's guide

ScaLAPACK user's guide
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem

A new O (N(2)) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
A Fortran Multiple-Precision Arithmetic Package

ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Making sparse Gaussian elimination scalable by static pivoting

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Faster Numerical Algorithms Via Exception Handling

IEEE Transactions on Computers
Accurate eigenvalues of a symmetric tri-diagonal matrix

Accurate eigenvalues of a symmetric tri-diagonal matrix

Analysis and comparison of two general sparse solvers for distributed memory computers

ACM Transactions on Mathematical Software (TOMS)
An updated set of basic linear algebra subprograms (BLAS)

ACM Transactions on Mathematical Software (TOMS)
Using Accurate Arithmetics to Improve Numerical Reproducibility and Stability in Parallel Applications

The Journal of Supercomputing
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems

ACM Transactions on Mathematical Software (TOMS)
More accuracy at fixed precision

Journal of Computational and Applied Mathematics - Special issue: Proceedings of the international conference on linear algebra and arithmetic, Rabat, Morocco, 28-31 May 2001
An overview of SuperLU: Algorithms, implementation, and user interface

ACM Transactions on Mathematical Software (TOMS) - Special issue on the Advanced CompuTational Software (ACTS) Collection
Provably faithful evaluation of polynomials

Proceedings of the 2006 ACM symposium on Applied computing
Error bounds from extra-precise iterative refinement

ACM Transactions on Mathematical Software (TOMS)
Generic programming and high-performance libraries

International Journal of Parallel Programming - Special issue: The next generation software program
Gaussian elimination: a case study in efficient genericity with MetaOCaml

Science of Computer Programming - Special issue on the first MetaOCaml workshop 2004
Super-fast validated solution of linear systems

Journal of Computational and Applied Mathematics - Special issue: Scientific computing, computer arithmetic, and validated numerics (SCAN 2004)
Convergence of Rump's method for inverting arbitrarily ill-conditioned matrices

Journal of Computational and Applied Mathematics
The schur aggregation for solving linear systems of equations

Proceedings of the 2007 international workshop on Symbolic-numeric computation
Additive preconditioning and aggregation in matrix computations

Computers & Mathematics with Applications
A parallel algorithm for accurate dot product

Parallel Computing
Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

International Journal of Parallel, Emergent and Distributed Systems
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering
Schur aggregation for linear systems and determinants

Theoretical Computer Science
Extra-Precise Iterative Refinement for Overdetermined Least Squares Problems

ACM Transactions on Mathematical Software (TOMS)
A new error-free floating-point summation algorithm

Computers & Mathematics with Applications
Error-Free Transformation in Rounding Mode toward Zero

Numerical Validation in Current Hardware Architectures
Optimal and Near-Optimal Energy-Efficient Broadcasting in Wireless Networks

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Prospectus for the next LAPACK and ScaLAPACK libraries

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing

Proceedings of the 24th ACM International Conference on Supercomputing
A decimal floating-point accurate scalar product unit with a parallel fixed-point multiplier on a virtex-5 FPGA

International Journal of Reconfigurable Computing - Special issue on selected papers from ReconFig 2009 International conference on reconfigurable computing and FPGAs (ReconFig 2009)
Accurate Matrix Factorization: Inverse LU and Inverse QR Factorizations

SIAM Journal on Matrix Analysis and Applications
Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications

Numerical Algorithms
Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic

Information and Computation
Verified Bounds for Least Squares Problems and Underdetermined Linear Systems

SIAM Journal on Matrix Analysis and Applications
Accurate solution of dense linear systems, part I: Algorithms in rounding to nearest

Journal of Computational and Applied Mathematics
Accurate evaluation of the k-th derivative of a polynomial and its application

Journal of Computational and Applied Mathematics
Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms

Proceedings of the 27th international ACM conference on International conference on supercomputing
Automatically adapting programs for mixed-precision floating-point computation

Proceedings of the 27th international ACM conference on International conference on supercomputing
FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

The Journal of Supercomputing
Precimonious: tuning assistant for floating-point precision

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article describes the design rationale, a C implementation, and conformance testing of a subset of the new Standard for the BLAS (Basic Linear Algebra Subroutines): Extended and Mixed Precision BLAS. Permitting higher internal precision and mixed input/output types and precisions allows us to implement some algorithms that are simpler, more accurate, and sometimes faster than possible without these features. The new BLAS are challenging to implement and test because there are many more subroutines than in the existing Standard, and because we must be able to assess whether a higher precision is used for internal computations than is used for either input or output variables. We have therefore developed an automated process of generating and systematically testing these routines. Our methodology is applicable to languages besides C. In particular, our algorithms used in the testing code will be valuable to all other BLAS implementors. Our extra precision routines achieve excellent performance---close to half of the machine peak Megaflop rate even for the Level 2 BLAS, when the data access is stride one.