A model for estimating trace-sample miss ratios
SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A modified approach to data cache management
Proceedings of the 28th annual international symposium on Microarchitecture
The SimpleScalar tool set, version 2.0
ACM SIGARCH Computer Architecture News
Utilizing reuse information in data cache management
ICS '98 Proceedings of the 12th international conference on Supercomputing
IEEE Transactions on Computers
ACM Computing Surveys (CSUR)
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
ACM SIGARCH Computer Architecture News
Scavenger: A New Last Level Cache Architecture with Global Block Priority
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Counter-Based Cache Replacement and Bypassing Algorithms
IEEE Transactions on Computers
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Divide-and-conquer: a bubble replacement for low level caches
Proceedings of the 23rd international conference on Supercomputing
A study of replacement algorithms for a virtual-storage computer
IBM Systems Journal
Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors
Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors
Instruction-based reuse-distance prediction for effective cache management
SAMOS'09 Proceedings of the 9th international conference on Systems, architectures, modeling and simulation
High performance cache replacement using re-reference interval prediction (RRIP)
Proceedings of the 37th annual international symposium on Computer architecture
Exploiting reuse locality on inclusive shared last-level caches
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Exploiting reuse information to reduce refresh energy in on-chip eDRAM caches
Proceedings of the 27th international ACM conference on International conference on supercomputing
Low-energy volatile STT-RAM cache design using cache-coherence-enabled adaptive refresh
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Hi-index | 0.00 |
Memory latency has become an important performance bottleneck in current microprocessors. This problem aggravates as the number of cores sharing the same memory controller increases. To palliate this problem, a common solution is to implement cache hierarchies with large or huge Last-Level Cache (LLC) organizations. LLC memories are implemented with a high number of ways (e.g., 16) to reduce conflict misses. Typically, caches have implemented the LRU algorithm to exploit temporal locality, but its performance goes away from the optimal as the number of ways increases. In addition, the implementation of a strict LRU algorithm is costly in terms of area and power. This article focuses on a family of low-cost replacement strategies, whose implementation scales with the number of ways while maintaining the performance. The proposed strategies track the accessing order for just a few blocks, which cannot be replaced. The victim is randomly selected among those blocks exhibiting poor locality. Although, in general, the random policy helps improving the performance, in some applications the scheme fails with respect to the LRU policy leading to performance degradation. This drawback can be overcome by the addition of a small victim cache of the large LLC. Experimental results show that, using the best version of the family without victim cache, MPKI reduction falls in between 10% and 11% compared to a set of the most representative state-of-the-art algorithms, whereas the reduction grows up to 22% with respect to LRU. The proposal with victim cache achieves speedup improvements, on average, by 4% compared to LRU. In addition, it reduces dynamic energy, on average, up to 8%. Finally, compared to the studied algorithms, hardware complexity is largely reduced by the baseline algorithm of the family.