Custom floating-point unit generation for embedded systems

Authors:
Yee Jern Chong;Sri Parameswaran
Affiliations:
School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW, Australia;School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW, Australia
Venue:
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Year:
2009

Citing 20
Cited 2

Area and performance tradeoffs in floating-point divide and square-root implementations

ACM Computing Surveys (CSUR)
Quadratic zero-one programming based synthesis of application specific data paths

ICCAD '93 Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design
Resource sharing in hierarchical synthesis

ICCAD '97 Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design
Generalized resource sharing

ICCAD '97 Proceedings of the 1997 IEEE/ACM international conference on Computer-aided design
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
High-Speed Booth Encoded Parallel Multiplier Design

IEEE Transactions on Computers - Special issue on computer arithmetic
Layout-driven resource sharing in high-level synthesis

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Customising Floating-Point Designs

FCCM '02 Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A method of automatic data path synthesis

DAC '83 Proceedings of the 20th Design Automation Conference
Floating Point Unit Generation and Evaluation for FPGAs

FCCM '03 Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Computer Organization and Design

Computer Organization and Design
Area-efficient instruction set synthesis for reconfigurable system-on-chip designs

Proceedings of the 41st annual Design Automation Conference
Unifying Bit-Width Optimisation for Fixed-Point and Floating-Point Designs

FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A Generator of High-Speed Floating-Point Modules

FCCM '04 Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
Rapid Embedded Hardware/Software System Generation

VLSID '05 Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design
Applying Resource Sharing Algorithms to ADL-driven Automatic ASIP Implementation

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Floating-point behavioral synthesis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Bitwidth cognizant architecture synthesis of custom hardware accelerators

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Efficient datapath merging for partially reconfigurable architectures

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

A shared-FPU architecture for ultra-low power MPSoCs

Proceedings of the ACM International Conference on Computing Frontiers
Instruction set extensions for dynamic time warping

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	0.03

Visualization

Abstract

While application-specific instruction-set processors (ASIPs) have allowed designers to create processors with custom instructions to target specific applications, floating-point (FP) units (FPUs) are still instantiated as noncustomizable general-purpose units, which, if underutilized, wastes area and performance. Therefore, there is a need for custom FPUs for embedded systems. To create a custom FPU, the subset of FP instructions that should be implemented in hardware has to be determined. Implementing more instructions in hardware reduces the cycle count of the application but may lead to increased latency if the critical delay of the FPU increases. Therefore, a balance between the hardware-implemented and the software-emulated instructions, which produces the best performance, must be found. In order to find this balance, a rapid design space exploration was performed to explore the tradeoffs between the area and the performance. In order to reduce the area of the custom FPU, it is desirable to merge the datapaths for each of the FP operations so that redundant hardware is minimized. However, FP datapaths are complex and contain components with varying bit widths; hence, sharing components of different bit widths is necessary. This introduces the problem of bit alignment, which involves determining how smaller resources should be aligned within larger resources when merged. A novel algorithm for solving the bit-alignment problem during datapath merging was developed. Our results show that adding more FP hardware does not necessarily equate to lower runtime if the delays associated with the additional hardware overcomes the cycle count reductions. We found that, with the Mediabench applications, datapath merging with bit alignment reduced area by an average of 22.5%, compared with an average of 14.1% without bit alignment. With the Standard Performance Evaluation Corporation (SPEC) CPU2000 FP (CFP2000) applications, datapath merging with bit alignment reduced area by an average of 7.6%, compared with an average of 3.9% without bit alignment. The less pronounced improvement with the SPEC CFP2000 benchmarks occurs because the SPEC CFP2000 applications predominantly use double-precision operations only. Therefore, there are fewer resources with different bit widths, which benefit less from bit alignment.