Super-Scalar Processor Design

Authors:
William M. Johnson
Affiliations:
-
Venue:
Super-Scalar Processor Design
Year:
1989

Citing 0
Cited 8

Architecture and implementation of a VLIW supercomputer

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Single instruction stream parallelism is greater than two

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
DSNS (dynamically-hazard-resolved statically-code-scheduled, nonuniform superscalar): yet another superscalar processor architecture

ACM SIGARCH Computer Architecture News
Strategies for branch target buffers

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
A comparative performance evaluation of various state maintenance mechanisms

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Boosting beyond static scheduling in a superscalar processor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Design and Implementation Trade-Offs in the Clipper C400 Architecture

IEEE Micro
Efficient Exploitation of Instruction-Level Parallelism for Superscalar Processors by the Conjugate Register File Scheme

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

A super-scalar processor is one that is capable of sustaining an instruction-execution rate of more than one instruction per clock cycle. Maintaining this execution rate is primarily a problem of scheduling processor resources (such as functional units) for high utilization. A number of scheduling algorithms have been published, with wide-ranging claims of performance over the single-instruction issue of a scalar processor. However, a number of these claims are based on idealizations or on special-purpose applications. This study uses trace-driven simulation to evaluate many different super-scalar hardware organizations. Super-scalar performance is limited primarily by instruction-fetch inefficiencies caused by both branch delays and instruction misalignment. Because of this instruction-fetch limitation, it is not worthwhile to explore highly-concurrent execution hardware. Rather, it is more appropriate to explore economical execution hardware that more closely matches the instruction throughput provided by the instruction fetcher. This study examines techniques for reducing the instruction-fetch inefficiencies and explores the resulting hardware organizations. This study concludes that a super-scalar processor can have nearly twice the performance of a scalar processor, but that this requires that four major hardware features: out-of-order execution, register renaming, branch prediction, and a four-instruction decoder. These features are interdependent, and removing any single feature reduces average performance by 18% or more. However, there are many hardware simplifications that cause only a small performance reduction.