Disk cache—miss ratio analysis and design considerations
ACM Transactions on Computer Systems (TOCS)
Clocking Schemes for High-Speed Digital Systems
IEEE Transactions on Computers
Optimal pipelining in supercomputers
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Sequentiality and prefetching in database systems
ACM Transactions on Database Systems (TODS)
ACM Computing Surveys (CSUR)
ACM Computing Surveys (CSUR)
Long term file migration: development and evaluation of algorithms
Communications of the ACM
Computer system design using a hierarchical approach to performance evaluation
Communications of the ACM
Performance evaluation of highly concurrent computers by deterministic simulation
Communications of the ACM
The Architecture of Symbolic Computers
The Architecture of Symbolic Computers
Computer Architecture and Parallel Processing
Computer Architecture and Parallel Processing
Measurement and analysis of instruction use in the VAX-11/780
ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
A study of branch prediction strategies
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
A modeling approach and design tool for pipelined central processors
ISCA '79 Proceedings of the 6th annual symposium on Computer architecture
An instruction timing model of CPU performance
ISCA '77 Proceedings of the 4th annual symposium on Computer architecture
Exploring a Stack Architecture
Computer
Dynamic Characteristics of Loops
IEEE Transactions on Computers
Dynamic Profile of Instruction Sequences for the IBM System/370
IEEE Transactions on Computers
Two-Level Replacement Decisions in Paging Stores
IEEE Transactions on Computers
A study of replacement algorithms for a virtual-storage computer
IBM Systems Journal
Two-level adaptive training branch prediction
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Alternative implementations of two-level adaptive branch prediction
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A comprehensive instruction fetch mechanism for a processor supporting speculative execution
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Understanding some simple processor-performance limits
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Alternative implementations of two-level adaptive branch prediction
25 years of the international symposia on Computer architecture (selected papers)
The optimum pipeline depth for a microprocessor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Architectural differences of efficient sequential and parallel computers
Journal of Systems Architecture: the EUROMICRO Journal
Optimizing pipelines for power and performance
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimum Power/Performance Pipeline Depth
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A First-Order Superscalar Processor Model
Proceedings of the 31st annual international symposium on Computer architecture
The optimum pipeline depth considering both power and performance
ACM Transactions on Architecture and Code Optimization (TACO)
A mechanistic performance model for superscalar out-of-order processors
ACM Transactions on Computer Systems (TOCS)
Micro-architecture performance estimation by formula
SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Predicting Performance Impact of DVFS for Realistic Memory Systems
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 14.98 |
The nature by which branches and data dependencies generate delays that degrade pipeline performance is investigated in this paper. We show that for the general execution trace, few specific delays can be considered in isolation; rather, the magnitude of any specific delay may depend on the relative proximity of other delays. This phenomenon can make the task of accurately characterizing a trace tape with simple statistics intractable. We present a set of trace reductions that facilitates this task by simplifying the corresponding data-dependency graph. The reductions operate on multiple data-dependency arcs and branches in conjunction; those arcs whose performance implications are redundant with respect to the dependency graph are identified, and eliminated from the graph. We show that the reduced graph can be accurately characterized by simple statistics. We use these statistics to show that as the length of a pipeline increases, the performance degradation due to data dependencies and branches increases monotonically. However, lengthening the pipeline may correspond to decreasing the cycle time of the pipeline. These two opposing effects are used in conjunction to derive an equation for optimal pipeline length for a given trace tape. The optimal pipeline length is shown to be characterized by n = √γα where γ is the ratio of overall circuit delay to latching overhead, and a is a function of the trace statistics that accounts for the delays induced by data dependencies and branches.