A Floating-Point Unit for 4D Vector Inner Product with Reduced Latency

Authors:
Donghyun Kim;Lee-Sup Kim
Affiliations:
Qualcomm Inc., San Diego;Korea Advanced Institute of Science and Technology, Daejeon
Venue:
IEEE Transactions on Computers
Year:
2009

Citing 0
Cited 5

A dual-shader 3-D graphics processor with fast 4-D vector inner product units and power-aware texture cache

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Self-Alignment Schemes for the Implementation of Addition-Related Floating-Point Operators

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
A mobile 3-D display processor with a bandwidth-saving subdivider

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Homogeneous stream processors with embedded special function units for high-utilization programmable shaders

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Full Length Article: A high performance, area efficient TTA-like vertex shader architecture with optimized floating point arithmetic unit for embedded graphics applications

Microprocessors & Microsystems

Quantified Score

Hi-index	14.98

Visualization

Abstract

This paper presents the algorithm and implementation of a new high-performance functional unit for floating-point four-dimensional vector inner product (4D dot product; DP4), which is most frequently performed in 3D graphics application. The proposed IEEE-compliant DP4 unit computes {\rm Z} = {\rm AB} + {\rm CD} + {\rm EF} + {\rm GH} in one path and keeps the intermediate rounding by IEEE-754 rounding to nearest even. The intermediate rounding is merged with shift alignment, and intermediate carry-propagated addition and normalization are omitted to reduce latency in the proposed architecture. The proposed DP4 unit is implemented with 0.18-\mu{\rm m} CMOS technology and has 12.8-ns critical path delay, which is reduced by 45.5 percent compared to a previous DP4 implementation using discrete multipliers and adders. The proposed DP4 unit also reduces the cycle time of 3D graphics applications by 12.4 percent on the average compared to the usual 3D graphics FPU based on four-way multiply-add-fused units.