Structure of Computers and Computations
Structure of Computers and Computations
Design of a Computer—The Control Data 6600
Design of a Computer—The Control Data 6600
Fast temporary storage for serial and parallel execution
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Characterization of branch and data dependencies on programs for evaluating pipeline performance
IEEE Transactions on Computers
The performance potential of multiple functional unit processors
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
IEEE Transactions on Computers
MOVE: a framework for high-performance processor design
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Clocked and asynchronous instruction pipelines
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
The optimum pipeline depth for a microprocessor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Architectural differences of efficient sequential and parallel computers
Journal of Systems Architecture: the EUROMICRO Journal
Optimizing pipelines for power and performance
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimum Power/Performance Pipeline Depth
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A First-Order Superscalar Processor Model
Proceedings of the 31st annual international symposium on Computer architecture
Power-optimal pipelining in deep submicron technology
Proceedings of the 2004 international symposium on Low power electronics and design
The optimum pipeline depth considering both power and performance
ACM Transactions on Architecture and Code Optimization (TACO)
Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies
IEEE Transactions on Computers
A mechanistic performance model for superscalar out-of-order processors
ACM Transactions on Computer Systems (TOCS)
Applied inference: Case studies in microarchitectural design
ACM Transactions on Architecture and Code Optimization (TACO)
Performance analysis of multi-threaded multi-core CPUs
Proceedings of the First International Workshop on Many-core Embedded Systems
Hi-index | 0.01 |
This paper examines the relationship between the degree of central processor pipelining and performance. This relationship is studied in the context of modern supercomputers. Limitations due to instruction dependencies are studied via simulations of the CRAY-1S. Both scalar and vector code are studied. This study shows that instruction dependencies severely limit performance for scalar code as well as overall performance.The effects of latch overhead are then considered. The primary cause of latch overhead is the difference between maximum and minimum gate propagation delays. This causes both the skewing of data as it passes along the data path, and unintentional clock skewing due to clock fanout logic. Latch overhead is studied analytically in order to lower bound the clock period that may be used in a pipelined system. This analysis also touches on other points related to latch clocking. This analysis shows that for short pipeline segments both the Earle latch and polarity hold latch give the same clock period bound for both single-phase and multi-phase clocks. Overhead due to data skew and unintentional clock skew are each added to the CRAY-1S simulation model. Simulation results with realistic assumptions show that eight to ten gate levels per pipeline segment lead to optimal overall performance. The results also show that for short pipeline segments data skew and clock skew contribute about equally to the degradation in performance.