High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Technology and design challenges for low power and high performance
ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Efficient dynamic scheduling through tag elimination
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Handling long-latency loads in a simultaneous multithreading processor
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor
IEEE Micro
A pipelined memory architecture for high throughput network processors
Proceedings of the 30th annual international symposium on Computer architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Instruction packing: reducing power and delay of the dynamic scheduling logic
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Reducing latencies of pipelined cache accesses through set prediction
Proceedings of the 19th annual international conference on Supercomputing
Restrictive Compression Techniques to Increase Level 1 Cache Capacity
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Power-Efficient Wakeup Tag Broadcast
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
ReCycle:: pipeline adaptation to tolerate process variation
Proceedings of the 34th annual international symposium on Computer architecture
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies
IEEE Transactions on Computers
Proceedings of the 5th conference on Computing frontiers
Shapeshifter: Dynamically changing pipeline width and speed to address process variations
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Recruiting Decay for Dynamic Power Reduction in Set-Associative Caches
Transactions on High-Performance Embedded Architectures and Compilers II
Applying decay to reduce dynamic power in set-associative caches
HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Hardware/software approaches for reducing the process variation impact on instruction fetches
ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Hi-index | 0.00 |
Previous papers have shown that the slow scaling of wiredelays compared to logic delays will prevent superscalar performancefrom scaling with technology.In this paper we showthat the optimal pipeline for superscalar becomes shallowerwith technology, when wire delays are considered, tighteningprevious results that deeper pipelines perform only as well asshallower pipelines.The key reason for the lack of performancescaling is that superscalar does not have sufficient parallelismto hide the relatively-increased wire delays.However,Simultaneous Multithreading (SMT) provides the much-neededparallelism.We show that an SMT running a multiprogrammedworkload with just 4-way issue not only retains theoptimal pipeline depth over technology generations, enablingat least 43% increase in clock speed every generation, but alsoachieves the remainder of the expected speedup of two pergeneration through IPC.As wire delays become more dominantin future technologies, the number of programs needs tobe scaled modestly to maintain the scaling trends, at least tillthe near-future 50nm technology.While this result ignoresbandwidth constraints, using SMT to tolerate latency due towire delays is not that simple because SMT causes bandwidthproblems.Most of the stages of a modern out-of-order-issuepipeline employ RAM and CAM structures.Wire delays in conventional,latency-optimized RAM/CAM structures preventthem from being pipelined in a scaled manner.We show thatthis limitation prevents scaling of SMT throughput.We use bitlinescaling to allow RAM/CAM bandwidth to scale with technology.Bitline scaling enables SMT throughput to scale at therate of two per technology generation in the near future.