Wire Delay is Not a Problem for SMT (In the Near Future)

Authors:
T. N. Vijaykumar;Zeshan Chishti
Affiliations:
Purdue University;Purdue University
Venue:
Proceedings of the 31st annual international symposium on Computer architecture
Year:
2004

Citing 18
Cited 13

High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Technology and design challenges for low power and high performance

ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Efficient dynamic scheduling through tag elimination

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Handling long-latency loads in a simultaneous multithreading processor

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Alpha 21264 Microprocessor

IEEE Micro
A pipelined memory architecture for high throughput network processors

Proceedings of the 30th annual international symposium on Computer architecture
Using Internal Redundant Representations and Limited Bypass to Support Pipelined Adders and Register Files

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Exploring High Bandwidth Pipelined Cache Architecture for Scaled Technology

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1

Instruction packing: reducing power and delay of the dynamic scheduling logic

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Reducing latencies of pipelined cache accesses through set prediction

Proceedings of the 19th annual international conference on Supercomputing
Restrictive Compression Techniques to Increase Level 1 Cache Capacity

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Power-Efficient Wakeup Tag Broadcast

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
SAVS: a self-adaptive variable supply-voltage technique for process- tolerant and power-efficient multi-issue superscalar processor design

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
ReCycle:: pipeline adaptation to tolerate process variation

Proceedings of the 34th annual international symposium on Computer architecture
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies

IEEE Transactions on Computers
Reducing the impact of intra-core process variability with criticality-based resource allocation and prefetching

Proceedings of the 5th conference on Computing frontiers
Shapeshifter: Dynamically changing pipeline width and speed to address process variations

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Recruiting Decay for Dynamic Power Reduction in Set-Associative Caches

Transactions on High-Performance Embedded Architectures and Compilers II
Applying decay to reduce dynamic power in set-associative caches

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Reducing delay and power consumption of the wakeup logic through instruction packing and tag memoization

PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Hardware/software approaches for reducing the process variation impact on instruction fetches

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous papers have shown that the slow scaling of wiredelays compared to logic delays will prevent superscalar performancefrom scaling with technology.In this paper we showthat the optimal pipeline for superscalar becomes shallowerwith technology, when wire delays are considered, tighteningprevious results that deeper pipelines perform only as well asshallower pipelines.The key reason for the lack of performancescaling is that superscalar does not have sufficient parallelismto hide the relatively-increased wire delays.However,Simultaneous Multithreading (SMT) provides the much-neededparallelism.We show that an SMT running a multiprogrammedworkload with just 4-way issue not only retains theoptimal pipeline depth over technology generations, enablingat least 43% increase in clock speed every generation, but alsoachieves the remainder of the expected speedup of two pergeneration through IPC.As wire delays become more dominantin future technologies, the number of programs needs tobe scaled modestly to maintain the scaling trends, at least tillthe near-future 50nm technology.While this result ignoresbandwidth constraints, using SMT to tolerate latency due towire delays is not that simple because SMT causes bandwidthproblems.Most of the stages of a modern out-of-order-issuepipeline employ RAM and CAM structures.Wire delays in conventional,latency-optimized RAM/CAM structures preventthem from being pipelined in a scaled manner.We show thatthis limitation prevents scaling of SMT throughput.We use bitlinescaling to allow RAM/CAM bandwidth to scale with technology.Bitline scaling enables SMT throughput to scale at therate of two per technology generation in the near future.