Hierarchical registers for scientific computers
ICS '88 Proceedings of the 2nd international conference on Supercomputing
The performance impact of incomplete bypassing in processor pipelines
Proceedings of the 28th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Wattch: a framework for architectural-level power analysis and optimizations
Proceedings of the 27th annual international symposium on Computer architecture
Two-level hierarchical register file organization for VLIW processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Focusing processor policies via critical-path prediction
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
The Alpha 21264 Microprocessor
IEEE Micro
Using SimPoint for accurate and efficient simulation
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Banked multiported register files for high-frequency superscalar microprocessors
Proceedings of the 30th annual international symposium on Computer architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
A case study: power and performance improvement of a chip multiprocessor for transaction processing
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
Researchers have proposed clustered microarchitectures for performance and energy effciency. Typically, clustered microarchitectures offer fast, local bypassing between instructions within clusters but global bypasses are slower. Traditional clustered microarchitectures (TCM) are implemented by partitioning the register file and associated functional units to clusters. This paper demonstrates an alternate implementation - Incomplete bypass-based clustered microarchitecture (IBCM). IBCM reduces the length of bypass wires by 42.4% resulting in an 8.9% reduction of "Execute" stage delay. This delay reduction in the critical EX stage enables voltage scaling that results in significantly lower average power consumption (between 11.7% and 19.5% lower) while achieving identical performance.