Implementation of the Exponential Function in a Floating-Point Unit
Journal of VLSI Signal Processing Systems
Delay-Optimized Implementation of IEEE Floating-Point Addition
IEEE Transactions on Computers
FPU Implementations with Denormalized Numbers
IEEE Transactions on Computers
Dual-mode floating-point multiplier architectures with parallel operations
Journal of Systems Architecture: the EUROMICRO Journal
Dual-mode floating-point adder architectures
Journal of Systems Architecture: the EUROMICRO Journal
Fast, Efficient Floating-Point Adders and Multipliers for FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Prenormalization rounding in IEEE floating-point operations using a flagged prefix adder
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Bridge floating-point fused multiply-add design
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Quasi-delay-insensitive computing device: methodological aspects and practical implementation
PATMOS'09 Proceedings of the 19th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Speculative hardware/software co-designed floating-point multiply-add fusion
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.01 |
Abstract: An IEEE compliant, 1 GHz Sparc64-V Floating-Point Unit (FPU) with reliability-accessibility-serviceability (RAS) features and partial support for denormal operands and results is presented. The FPU has two functional units, each with an adder (FADD) and a multiplier (FMUL). Additionally, one of the functional units contains a graphics unit (VIS). Two floating-point instructions can be scheduled out of order each cycle, one to each functional unit. A peak performance of 4 GFLOP is achieved by scheduling two floating-point multiply add (FMA) instructions each cycle. The FADD unit is fully pipelined and can execute an addition, subtraction, conversion, or compare instruction every cycle. The FMUL unit executes pipelined multiply instructions. Divide and square-root instructions are executed with multiple iterations through the multiplier pipeline. The VIS unit is also pipelined and executes SIMD fixed-point graphics instructions. The adder and multiplier have latencies of 3 and 4 cycles respectively. Novel techniques are presented in the adder and multiplier implementations to reduce area and cycle time. The FPU provides RAS support for enhanced server reliability by using selective parity error detection. The FPU has been implemented in 0.15u, 6-layer metal CMOS technology.