Computer arithmetic systems: algorithms, architecture and implementation
Computer arithmetic systems: algorithms, architecture and implementation
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor
Digital Technical Journal - Special 10th anniversary issue
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Optimal Circuits for Parallel Multipliers
IEEE Transactions on Computers
Computer arithmetic: algorithms and hardware designs
Computer arithmetic: algorithms and hardware designs
Automatic application-specific instruction-set extensions under microarchitectural constraints
Proceedings of the 40th annual Design Automation Conference
A reconfigurable signal processing IC with embedded FPGA and multi-port flash memory
Proceedings of the 40th annual Design Automation Conference
Itanium 2 Processor Microarchitecture
IEEE Micro
Picking Statistically Valid and Early Simulation Points
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
System Design Using Kahn Process Networks: The Compaan/Laura Approach
Proceedings of the conference on Design, automation and test in Europe - Volume 1
Evaluation of the field-programmable cache: performance and energy consumption
Proceedings of the 3rd conference on Computing frontiers
Combining multicore and reconfigurable instruction set extensions
Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays
Hi-index | 0.00 |
When adding reconfigurability to custom hardware, one must take great care that the reduction in speed due to the reconfigurable logic should not cancel out the gains obtained by reconfiguration. These gains are greatest in very specific and computation-intensive applications, and lessen as the applications become more general and heterogeneous. In the case of superscalar processors, this leads to limiting the amount of reconfigurability to precise changes in existing functional units instead of adding a fully configurable functional unit. We present a detailed study of the modifications necessary in a superscalar processor to allow an FPU to be dynamically reconfigured as several ALUs with a minimal increase in the latency of these functional units. The timing of the FPU's multiplier tree and the decision about reconfiguration are exposed. As there is more than one simple unit involved, this decision is more global than a cycle-by-cycle reconfiguration and must be made for a longer period of time. We discuss possible policies for the dynamic reconfiguration decisions. The results show interesting gains of up to 56% in the best cases, and average gains of 10%, on typical architectures over a wide range of applications.