Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)

Authors:
Julie Langou;Julien Langou;Piotr Luszczek;Jakub Kurzak;Alfredo Buttari;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville TN;University of Tennessee, Knoxville TN;University of Tennessee, Knoxville TN;University of Tennessee, Knoxville TN;University of Tennessee, Knoxville TN;University of Tennessee, Knoxville TN
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 11
Cited 11

The algebraic eigenvalue problem

The algebraic eigenvalue problem
Algorithm 710: FORTRAN subroutines for computing the eigenvalues and eigenvectors of a general matrix by reduction to general tridiagonal form

ACM Transactions on Mathematical Software (TOMS)
Applied numerical linear algebra

Applied numerical linear algebra
ScaLAPACK user's guide

ScaLAPACK user's guide
Iterative Refinement in Floating Point

Journal of the ACM (JACM)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Algorithm 589: SICEDR: A FORTRAN Subroutine for Improving the Accuracy of Computed Matrix Eigenvalues

ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Exploiting fast hardware floating point in high precision computation

ISSAC '03 Proceedings of the 2003 international symposium on Symbolic and algebraic computation
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Parallel Computing
Optimizing sparse matrix-vector multiplication using index and value compression

Proceedings of the 5th conference on Computing frontiers
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

ACM Transactions on Mathematical Software (TOMS)
Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

International Journal of Parallel, Emergent and Distributed Systems
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering
Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
Exploiting compression opportunities to improve SpMxV performance on shared memory systems

ACM Transactions on Architecture and Code Optimization (TACO)
The impact of data distribution in accuracy and performance of parallel linear algebra subroutines

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Exploiting dense substructures for fast sparse matrix vector multiplication

International Journal of High Performance Computing Applications
A convolve-and-merge approach for exact computations on high-performance reconfigurable computers

International Journal of Reconfigurable Computing - Special issue on High-Performance Reconfigurable Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 bit floating point arithmetic (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures and the IBM's Cell Broad Engine processor. When working in single precision, floating point operations can be performed up to two times faster on the Pentium and up to ten times faster on the Cell over double precision. The performance enhancements in these architectures are derived by accessing extensions to the basic architecture, such as SSE2 in the case of the Pentium and the vector functions on the IBM Cell. The motivation for this paper is to exploit single precision operations whenever possible and resort to double precision at critical stages while attempting to provide the full double precision results. The results described here are fairly general and can be applied to various problems in linear algebra such as solving large sparse systems, using direct or iterative methods and some eigenvalue problems. There are limitations to the success of this process, such as when the conditioning of the problem exceeds the reciprocal of the accuracy of the single precision computations. In that case the double precision algorithm should be used.