One Billion Transistors, One Uniprocessor, One Chip

Authors:
Yale N. Patt;Sanjay J. Patel;Marius Evers;Daniel H. Friendly;Jared Stark
Affiliations:
-;-;-;-;-
Venue:
Computer
Year:
1997

Citing 7
Cited 33

HPS, a new microarchitecture: rationale and introduction

MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Critical issues regarding HPS, a high performance microarchitecture

MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Target prediction for indirect jumps

Proceedings of the 24th annual international symposium on Computer architecture
Performance benefits of large execution atomic units in dynamically scheduled machines

ICS '89 Proceedings of the 3rd international conference on Supercomputing

On high-bandwidth data cache design for multi-issue processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Accurate indirect branch prediction

Proceedings of the 25th annual international symposium on Computer architecture
A dynamic scheduling logic for exploiting multiple functional units in single chip multithreaded architectures

Proceedings of the 1999 ACM symposium on Applied computing
PipeRench: a co/processor for streaming multimedia acceleration

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Decoupling local variable accesses in a wide-issue superscalar processor

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
An Algorithm-Hardware-System Approach to VLIW Multimedia Processors

Journal of VLSI Signal Processing Systems - special issue on multimedia signal processing
A comparison of scalable superscalar processors

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Access region locality for high-bandwidth processor memory system design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Circuits for wide-window superscalar processors

Proceedings of the 27th annual international symposium on Computer architecture
Inherently Lower-Power High-Performance Superscalar Architectures

IEEE Transactions on Computers
A time-stamping algorithm for efficient performance estimation of superscalar processors

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Data prefetching by dependence graph precomputation

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A High-Bandwidth Memory Pipeline for Wide Issue Processors

IEEE Transactions on Computers
Execution history guided instruction prefetching

ICS '02 Proceedings of the 16th international conference on Supercomputing
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Trident: a scalable architecture for scalar, vector, and matrix operations

CRPIT '02 Proceedings of the seventh Asia-Pacific conference on Computer systems architecture
Hardware Compilation: Translating Programs into Circuits

Computer
A survey of processors with explicit multithreading

ACM Computing Surveys (CSUR)
Multi-stage Cascaded Prediction

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Improving the Performance of Heterogeneous DSMs via Multithreading

VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
The Ultrascalar Processor-An Asymptotically Scalable Superscalar Microarchitecture

ARVLSI '99 Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI
Execution History Guided Instruction Prefetching

The Journal of Supercomputing
Two-level branch prediction using neural networks

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Synthesis and verification
Billion-Transistor Architectures: There and Back Again

Computer
Late Allocation and Early Release of Physical Registers

IEEE Transactions on Computers
Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Hardware support for early register release

International Journal of High Performance Computing and Networking
Evaluation of bus based interconnect mechanisms in clustered VLIW architectures

International Journal of Parallel Programming
Design and optimization of the store vectors memory dependence predictor

ACM Transactions on Architecture and Code Optimization (TACO)
Effect of increasing chip density on the evolution of computer architectures

IBM Journal of Research and Development
Dynamic branch prediction and control speculation

International Journal of High Performance Systems Architecture
Understanding prediction limits through unbiased branches

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture

Quantified Score

Hi-index	4.11

Visualization

Abstract

Researchers from the University of Michigan conclude that billion-transistor processors will be much as they are today, but just bigger, faster, and wider (issuing more instructions at once). The authors describe the key problems (instruction supply, data memory supply, and an implementable execution core) that prevent current superscalars from scaling up to the 16- or 32-instructions per issue. They propose using out-of-order fetching, Multi-Hybrid branch predictors, and trace caches to improve the instruction supply. They predict that replicated first-level caches, huge on-chip caches, and data value speculation will enhance the data supply. To provide a high-speed, implementable execution core capable of sustaining the necessary instruction throughput, they advocate a large, out-of-order-issue instruction window (2,000 instructions), clustered (separated) banks of functional units, and hierarchical scheduling of ready instructions. They contend that the current uniprocessor model can provide sufficient performance and use a billion transistors effectively without changing the programming model or discarding software compatibility.