Speculative hardware/software co-designed floating-point multiply-add fusion

Authors:
Marc Lupon;Enric Gibert;Grigorios Magklis;Sridhar Samudrala;Raúl Martínez;Kyriakos Stavrou;David R. Ditzel
Affiliations:
Intel Barcelona Research Center, Barcelona, Spain;Intel Barcelona Research Center, Barcelona, Spain;Intel Barcelona Research Center, Barcelona, Spain;work performed while at Intel Corporation, Austin, USA;Intel Barcelona Research Center, Barcelona, Spain;Intel Barcelona Research Center, Barcelona, Spain;work performed while at Intel Corporation, Santa Clara, USA
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 27
Cited 0

Design of the IBM RISC System/6000 floating-point execution unit

IBM Journal of Research and Development
Dynamic Binary Translation and Optimization

IEEE Transactions on Computers
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Accuracy and Stability of Numerical Algorithms

Accuracy and Stability of Numerical Algorithms
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Itanium Processor Microarchitecture

IEEE Micro
The Transmeta Code Morphing™ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
1-GHz HAL SPARC64® Dual Floating Point Unit with RAS Features

ARITH '01 Proceedings of the 15th IEEE Symposium on Computer Arithmetic
Power Awareness through Selective Dynamically Optimized Traces

Proceedings of the 31st annual international symposium on Computer architecture
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation

Proceedings of the 31st annual international symposium on Computer architecture
Floating-Point Fused Multiply-Add: Reduced Latency for Floating-Point Addition

ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
POWER3: the next generation of PowerPC processors

IBM Journal of Research and Development
A real system evaluation of hardware atomicity for software speculation

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Handbook of Floating-Point Arithmetic

Handbook of Floating-Point Arithmetic
Exact and Approximated Error of the FMA

IEEE Transactions on Computers
Bulldozer: An Approach to Multithreaded Compute Performance

IEEE Micro
Bridge floating-point fused multiply-add design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
AstroLIT: enabling simulation-based microarchitecture comparison between Intel® and Transmeta designs

Proceedings of the 8th ACM International Conference on Computing Frontiers
Latency Sensitive FMA Design

ARITH '11 Proceedings of the 2011 IEEE 20th Symposium on Computer Arithmetic
DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
A HW/SW co-designed heterogeneous multi-core virtual machine for energy-efficient general purpose computing

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A HW/SW Co-designed Programmable Functional Unit

IEEE Computer Architecture Letters
BlockChop: dynamic squash elimination for hybrid processor architecture

Proceedings of the 39th Annual International Symposium on Computer Architecture
Exploiting State-of-the-Art x86 Architectures in Scientific Computing

ISPDC '12 Proceedings of the 2012 11th International Symposium on Parallel and Distributed Computing
Acceldroid: Co-designed acceleration of Android bytecode

CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.