Performance optimization of pipelined primary cache

Authors:
Kunle Olukotun;Trevor Mudge;Richard Brown
Affiliations:
-;-;-
Venue:
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Year:
1992

Citing 12
Cited 14

Reducing the cost of branches

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
How many addressing modes are enough?

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Reducing the Branch Penalty in Pipelined Processors

Computer
Comparing software and hardware schemes for reducing the cost of branches

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Cache and memory hierarchy design: a performance-directed approach

Cache and memory hierarchy design: a performance-directed approach
The Design of a Microsupercomputer

Computer - Special issue on experimental research in computer architecture
Reducing the branch penalty by rearranging instructions in a double-width memory

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
MIPS RISC architectures

MIPS RISC architectures
Technology-organization tradeoffs in the architecture of a high performance processor

Technology-organization tradeoffs in the architecture of a high performance processor
A study of branch prediction strategies

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Aspects of cache memory and instruction buffer performance

Aspects of cache memory and instruction buffer performance

Tradeoffs in two-level on-chip caching

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A comparison of two pipeline organizations

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Resource allocation in a high clock rate microprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Direct-mapped versus set-associative pipelined caches

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The difference-bit cache

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Improving cache performance with balanced tag and data paths

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Strategic directions in computer architecture

ACM Computing Surveys (CSUR) - Special ACM 50th-anniversary issue: strategic directions in computing research
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
Multilevel Optimization of Pipelined Caches

IEEE Transactions on Computers
Low load latency through sum-addressed memory (SAM)

Proceedings of the 25th annual international symposium on Computer architecture
High Bandwidth On-Chip Cache Design

IEEE Transactions on Computers
A One's Complement Cache Memory

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01

Quantified Score

Hi-index	0.01

Visualization

Abstract

The CPU cycle time of a high-performance processor is usually determined by the access time of the primary cache. As processors speeds increase, designers will have to increase the number of pipeline stages used to fetch data from the cache in order to reduce the dependence of CPU cycle time on cache access time. This paper studies the performance advantages of a pipelined cache for a GaAs implementation of the MIPS based architecture using a design methodology that includes long traces of multiprogrammed applications and detailed timing analysis. The study evaluates instruction and data caches with various pipeline depths, cache sizes, block sizes, and refill penalties. The impact on CPU cycle time of these alternatives is also factored into the evaluation. Hardware-based and software-based strategies are considered for hiding the branch and load delays which may be required to avoid pipeline hazards. The results show that software-based methods for mitigating the penalty of branch delays can be as successful as the hardware-based branch-target buffer approach, despite the code-expansion inherent in the software methods. The situation is similar for load delays; while hardware-based dynamic methods hide more delay cycles than do static approaches, they may give up the advantage by extending the cycle time. Because these methods are quite successful at hiding small numbers of branch and load delays, and because processors with pipelined caches also have shorter CPU cycle times and larger caches, a significant performance advantage is gained by using two to three pipeline stages to fetch data from the cache.