Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

Authors:
Dominik Göddeke;Robert Strzodka;Stefan Turek
Affiliations:
Universität Dortmund, Fachbereich Mathematik, Dortmund, Germany;Stanford University, Stanford, CA, USA;Universität Dortmund, Fachbereich Mathematik, Dortmund, Germany
Venue:
International Journal of Parallel, Emergent and Distributed Systems
Year:
2007

Citing 29
Cited 23

Efficient high accuracy solutions with GMRES(m)

SIAM Journal on Scientific and Statistical Computing
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
A decade of reconfigurable computing: a visionary retrospective

Proceedings of the conference on Design, automation and test in Europe
A flexible floating-point format for optimizing data-paths and operators in FPGA based DSPs

FPGA '02 Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays
Reconfigurable computing: a survey of systems and software

ACM Computing Surveys (CSUR)
Design, implementation and testing of extended and mixed precision BLAS

ACM Transactions on Mathematical Software (TOMS)
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
Logarithmic Number System and Floating-Point Arithmetics on FPGA

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
A Library of Parameterized Floating-Point Modules and Their Use

FPL '02 Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications
A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting fast hardware floating point in high precision computation

ISSAC '03 Proceedings of the 2003 international symposium on Symbolic and algebraic computation
Algorithms for Quad-Double Precision Floating Point Arithmetic

ARITH '01 Proceedings of the 15th IEEE Symposium on Computer Arithmetic
Floating Point Unit Generation and Evaluation for FPGAs

FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A quantitative analysis of the speedup factors of FPGAs over processors

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Unifying Bit-Width Optimisation for Fixed-Point and Floating-Point Designs

FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A Comparison of Floating Point and Logarithmic Number Systems for FPGAs

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Rounding Errors in Algebraic Processes

Rounding Errors in Algebraic Processes
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Error bounds from extra-precise iterative refinement

ACM Transactions on Mathematical Software (TOMS)
Virtual Embedded Blocks: A Methodology for Evaluating Embedded Elements in FPGAs

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components

FCCM '06 Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Extended-precision floating-point numbers for GPU computation

ACM SIGGRAPH 2006 Research posters
Implementation of residue number systems on GPUs

ACM SIGGRAPH 2006 Research posters
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Lightweight floating-point arithmetic: case study of inverse discrete cosine transform

EURASIP Journal on Applied Signal Processing

Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

Parallel Computing
Fast recursive filters for simulating nonlinear dynamic systems

Neural Computation
Algorithmic performance studies on graphics processing units

Journal of Parallel and Distributed Computing
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering
Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA

Journal of Parallel and Distributed Computing
Concurrent number cruncher: a GPU implementation of a general sparse linear solver

International Journal of Parallel, Emergent and Distributed Systems
Integrated Digital Image Correlation for the Identification of Mechanical Properties

MIRAGE '09 Proceedings of the 4th International Conference on Computer Vision/Computer Graphics CollaborationTechniques
A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU

International Journal of Computational Science and Engineering
A comparison of three parallelisation methods for 2D flood inundation models

Environmental Modelling & Software
State-of-the-art in heterogeneous computing

Scientific Programming
GPU computing with Kaczmarz's and other iterative algorithms for linear systems

Parallel Computing
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
A survey of medical image registration on graphics hardware

Computer Methods and Programs in Biomedicine
Tuning the generation of sobol sequence with owen scrambling

LSSC'09 Proceedings of the 7th international conference on Large-Scale Scientific Computing
Mixed precision iterative refinement methods for linear systems: convergence analysis based on krylov subspace methods

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors

Computer Science - Research and Development
GPU-accelerated asynchronous error correction for mixed precision iterative refinement

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Efficient generation of large-scale pareto-optimal topologies

Structural and Multidisciplinary Optimization
Energy efficiency vs. performance of the numerical solution of PDEs: An application study on a low-power ARM-based cluster

Journal of Computational Physics
Automatically adapting programs for mixed-precision floating-point computation

Proceedings of the 27th international ACM conference on International conference on supercomputing
Accelerated finite element elastodynamic simulations using the GPU

Journal of Computational Physics
Numerical integration on GPUs for higher order finite elements

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this survey paper, we compare native double precision solvers with emulated-and mixed-precision solvers of linear systems of equations as they typically arise in finite element discretisations. The emulation utilises two single float numbers to achieve higher precision, while the mixed precision iterative refinement computes residuals and updates the solution vector in double precision but solves the residual systems in single precision. Both techniques have been known since the 1960s, but little attention has been devoted to their performance aspects. Motivated by changing paradigms in processor technology and the emergence of highly-parallel devices with outstanding single float performance, we adapt the emulation and mixed precision techniques to coupled hardware configurations, where the parallel devices serve as scientific co-processors. The performance advantages are examined with respect to speedups over a native double precision implementation (time aspect) and reduced area requirements for a chip (space aspect). The paper begins with an overview of the theoretical background, algorithmic approaches and suitable hardware architectures. We then employ several conjugate gradient (CG) and multigrid solvers and study their behaviour for different parameter settings of the iterative refinement technique. Concrete speedup factors are evaluated on the coupled hardware configuration of a general-purpose CPU and a graphics processor. The dual performance aspect of potential area savings is assessed on a field programmable gate array (FPGA). In the last part, we test the applicability of the proposed mixed precision schemes with ill-conditioned matrices. We conclude that the mixed precision approach works very well with the parallel co-processors gaining speedup factors of four to five, and area savings of three to four, while maintaining the same accuracy as a reference solver executing everything in double precision.