Architecture and implementation of a VLIW supercomputer
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Single instruction stream parallelism is greater than two
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
ACM SIGARCH Computer Architecture News
Strategies for branch target buffers
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
A comparative performance evaluation of various state maintenance mechanisms
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Boosting beyond static scheduling in a superscalar processor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
IEEE Transactions on Computers
Hi-index | 0.00 |
A super-scalar processor is one that is capable of sustaining an instruction-execution rate of more than one instruction per clock cycle. Maintaining this execution rate is primarily a problem of scheduling processor resources (such as functional units) for high utilization. A number of scheduling algorithms have been published, with wide-ranging claims of performance over the single-instruction issue of a scalar processor. However, a number of these claims are based on idealizations or on special-purpose applications. This study uses trace-driven simulation to evaluate many different super-scalar hardware organizations. Super-scalar performance is limited primarily by instruction-fetch inefficiencies caused by both branch delays and instruction misalignment. Because of this instruction-fetch limitation, it is not worthwhile to explore highly-concurrent execution hardware. Rather, it is more appropriate to explore economical execution hardware that more closely matches the instruction throughput provided by the instruction fetcher. This study examines techniques for reducing the instruction-fetch inefficiencies and explores the resulting hardware organizations. This study concludes that a super-scalar processor can have nearly twice the performance of a scalar processor, but that this requires that four major hardware features: out-of-order execution, register renaming, branch prediction, and a four-instruction decoder. These features are interdependent, and removing any single feature reduces average performance by 18% or more. However, there are many hardware simplifications that cause only a small performance reduction.