ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
How many addressing modes are enough?
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Comparing software and hardware schemes for reducing the cost of branches
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Cache and memory hierarchy design: a performance-directed approach
Cache and memory hierarchy design: a performance-directed approach
The Design of a Microsupercomputer
Computer - Special issue on experimental research in computer architecture
Reducing the branch penalty by rearranging instructions in a double-width memory
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
MIPS RISC architectures
Technology-organization tradeoffs in the architecture of a high performance processor
Technology-organization tradeoffs in the architecture of a high performance processor
A study of branch prediction strategies
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Aspects of cache memory and instruction buffer performance
Aspects of cache memory and instruction buffer performance
Tradeoffs in two-level on-chip caching
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A comparison of two pipeline organizations
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Resource allocation in a high clock rate microprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Instruction fetching: coping with code bloat
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Direct-mapped versus set-associative pipelined caches
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Improving cache performance with balanced tag and data paths
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Strategic directions in computer architecture
ACM Computing Surveys (CSUR) - Special ACM 50th-anniversary issue: strategic directions in computing research
Designing high bandwidth on-chip caches
Proceedings of the 24th annual international symposium on Computer architecture
Multilevel Optimization of Pipelined Caches
IEEE Transactions on Computers
Low load latency through sum-addressed memory (SAM)
Proceedings of the 25th annual international symposium on Computer architecture
High Bandwidth On-Chip Cache Design
IEEE Transactions on Computers
A One's Complement Cache Memory
ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Hi-index | 0.01 |
The CPU cycle time of a high-performance processor is usually determined by the access time of the primary cache. As processors speeds increase, designers will have to increase the number of pipeline stages used to fetch data from the cache in order to reduce the dependence of CPU cycle time on cache access time. This paper studies the performance advantages of a pipelined cache for a GaAs implementation of the MIPS based architecture using a design methodology that includes long traces of multiprogrammed applications and detailed timing analysis. The study evaluates instruction and data caches with various pipeline depths, cache sizes, block sizes, and refill penalties. The impact on CPU cycle time of these alternatives is also factored into the evaluation. Hardware-based and software-based strategies are considered for hiding the branch and load delays which may be required to avoid pipeline hazards. The results show that software-based methods for mitigating the penalty of branch delays can be as successful as the hardware-based branch-target buffer approach, despite the code-expansion inherent in the software methods. The situation is similar for load delays; while hardware-based dynamic methods hide more delay cycles than do static approaches, they may give up the advantage by extending the cycle time. Because these methods are quite successful at hiding small numbers of branch and load delays, and because processors with pipelined caches also have shorter CPU cycle times and larger caches, a significant performance advantage is gained by using two to three pipeline stages to fetch data from the cache.