Design of the IBM RISC System/6000 floating-point execution unit
IBM Journal of Research and Development
Dynamic Binary Translation and Optimization
IEEE Transactions on Computers
Performance characterization of a hardware mechanism for dynamic optimization
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Itanium Processor Microarchitecture
IEEE Micro
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
1-GHz HAL SPARC64® Dual Floating Point Unit with RAS Features
ARITH '01 Proceedings of the 15th IEEE Symposium on Computer Arithmetic
Power Awareness through Selective Dynamically Optimized Traces
Proceedings of the 31st annual international symposium on Computer architecture
Proceedings of the 31st annual international symposium on Computer architecture
Floating-Point Fused Multiply-Add: Reduced Latency for Floating-Point Addition
ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
VEAL: Virtualized Execution Accelerator for Loops
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
POWER3: the next generation of PowerPC processors
IBM Journal of Research and Development
A real system evaluation of hardware atomicity for software speculation
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Handbook of Floating-Point Arithmetic
Handbook of Floating-Point Arithmetic
Exact and Approximated Error of the FMA
IEEE Transactions on Computers
Bridge floating-point fused multiply-add design
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Proceedings of the 8th ACM International Conference on Computing Frontiers
ARITH '11 Proceedings of the 2011 IEEE 20th Symposium on Computer Arithmetic
DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A HW/SW Co-designed Programmable Functional Unit
IEEE Computer Architecture Letters
BlockChop: dynamic squash elimination for hybrid processor architecture
Proceedings of the 39th Annual International Symposium on Computer Architecture
Exploiting State-of-the-Art x86 Architectures in Scientific Computing
ISPDC '12 Proceedings of the 2012 11th International Symposium on Parallel and Distributed Computing
Acceldroid: Co-designed acceleration of Android bytecode
CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.