Optimal pipelining in supercomputers

Authors:
S. R. Kunkel;J. E. Smith
Affiliations:
Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, Wisconsin;Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, Wisconsin
Venue:
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Year:
1986

Citing 2
Cited 19

Structure of Computers and Computations

Structure of Computers and Computations
Design of a Computer—The Control Data 6600

Design of a Computer—The Control Data 6600

Fast temporary storage for serial and parallel execution

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Characterization of branch and data dependencies on programs for evaluating pipeline performance

IEEE Transactions on Computers
The performance potential of multiple functional unit processors

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Latch-to-Latch Timing Rules

IEEE Transactions on Computers
MOVE: a framework for high-performance processor design

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Clocked and asynchronous instruction pipelines

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Architectural differences of efficient sequential and parallel computers

Journal of Systems Architecture: the EUROMICRO Journal
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimum Power/Performance Pipeline Depth

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture
Power-optimal pipelining in deep submicron technology

Proceedings of the 2004 international symposium on Low power electronics and design
The optimum pipeline depth considering both power and performance

ACM Transactions on Architecture and Code Optimization (TACO)
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies

IEEE Transactions on Computers
A mechanistic performance model for superscalar out-of-order processors

ACM Transactions on Computer Systems (TOCS)
Applied inference: Case studies in microarchitectural design

ACM Transactions on Architecture and Code Optimization (TACO)
Performance analysis of multi-threaded multi-core CPUs

Proceedings of the First International Workshop on Many-core Embedded Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper examines the relationship between the degree of central processor pipelining and performance. This relationship is studied in the context of modern supercomputers. Limitations due to instruction dependencies are studied via simulations of the CRAY-1S. Both scalar and vector code are studied. This study shows that instruction dependencies severely limit performance for scalar code as well as overall performance.The effects of latch overhead are then considered. The primary cause of latch overhead is the difference between maximum and minimum gate propagation delays. This causes both the skewing of data as it passes along the data path, and unintentional clock skewing due to clock fanout logic. Latch overhead is studied analytically in order to lower bound the clock period that may be used in a pipelined system. This analysis also touches on other points related to latch clocking. This analysis shows that for short pipeline segments both the Earle latch and polarity hold latch give the same clock period bound for both single-phase and multi-phase clocks. Overhead due to data skew and unintentional clock skew are each added to the CRAY-1S simulation model. Simulation results with realistic assumptions show that eight to ten gate levels per pipeline segment lead to optimal overall performance. The results also show that for short pipeline segments data skew and clock skew contribute about equally to the degradation in performance.