Cache Operations by MRU Change
IEEE Transactions on Computers
High-performance computer architecture (2nd ed.)
High-performance computer architecture (2nd ed.)
Alternative implementations of two-level adaptive branch prediction
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Cache write policies and performance
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Data prefetching for high-performance processors
Data prefetching for high-performance processors
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
A data cache with multiple caching strategies tuned to different types of locality
ICS '95 Proceedings of the 9th international conference on Supercomputing
A modified approach to data cache management
Proceedings of the 28th annual international symposium on Microarchitecture
Predictability of load/store instruction latencies
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Run-time spatial locality detection and optimization
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Using prediction to accelerate coherence protocols
Proceedings of the 25th annual international symposium on Computer architecture
Optimal replacements in caches with two miss costs
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Hardware identification of cache conflict misses
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Selective, accurate, and timely self-invalidation using last-touch prediction
Proceedings of the 27th annual international symposium on Computer architecture
ACM Computing Surveys (CSUR)
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Design Issues and Tradeoffs for Write Buffers
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Improving CC-NUMA Performance Using Instruction-Based Prediction
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Cost-Sensitive Cache Replacement Algorithms
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Cost-sensitive cache replacement algorithms
Cost-sensitive cache replacement algorithms
Cache Replacement Algorithms with Nonuniform Miss Costs
IEEE Transactions on Computers
Instruction-based reuse-distance prediction for effective cache management
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
Hi-index | 0.00 |
Classic cache replacement policies assume that miss costs are uniform. However, the correlation between miss rate and cache performance is not as straightforward as it used to be. Ultimately, the true cost measure of a miss should be the penalty, i.e. the actual processing bandwidth lost because of the miss. It is known that, contrary to loads, the penalty of stores is mostly hidden in modern processors. To take advantage of this observation, we propose simple schemes to replace load misses by store misses. We extend classic replacement algorithms such as LRU (Least Recently Used) and PLRU (Partial LRU) to reduce the aggregate miss penalty instead of the miss count.One key issue is to predict the next access type to a block, so that higher replacement priority is given to blocks that will be accessed next with a store. We introduce and evaluate various prediction schemes based on instructions, and broadly inspired from branch predictors. To guide the design we run extensive trace-driven simulations on eight Spec95 benchmarks with a wide range of cache configurations and observe that our simple penalty-sensitive policies yield positive load miss improvements over classic algorithms across most the benchmarks and cache configurations. In some cases the improvements are very large.