High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Column-associative caches: a technique for reducing the miss rate of direct-mapped caches
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Cache design trade-offs for power and performance optimization: a case study
ISLPED '95 Proceedings of the 1995 international symposium on Low power design
On high-bandwidth data cache design for multi-issue processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Efficient Hardware Hashing Functions for High Performance Computers
IEEE Transactions on Computers
Optimizing the Instruction Cache Performance of the Operating System
IEEE Transactions on Computers
Selective cache ways: on-demand cache resource allocation
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dead-block prediction & dead-block correlating prefetchers
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
L1 data cache decomposition for energy efficiency
ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Compiling for instruction cache performance on a multithreaded architecture
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Partitioned first-level cache design for clustered microarchitectures
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Predictive sequential associative cache
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches
Proceedings of the 33rd annual international symposium on Computer Architecture
POWER5 System microarchitecture
IBM Journal of Research and Development - POWER5 and packaging
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
IBM Journal of Research and Development
Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
IBM System Z10 Enterprise Class Technical Guide
IBM System Z10 Enterprise Class Technical Guide
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
Proceedings of the 36th annual international symposium on Computer architecture
POWER4 system microarchitecture
IBM Journal of Research and Development
POWER7 multi-core processor design
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive line placement with the set balancing cache
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A cache-partitioning aware replacement policy for chip multiprocessors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Adaptive Set-Granular Cooperative Caching
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
ASCIB: adaptive selection of cache indexing bits for removing conflict misses
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Hi-index | 0.00 |
First-level caches are usually split for both instructions and data instead of unifying them in a single cache. Although that approach eases the pipeline design and provides a simple way to independently treat data and instructions, its global hit rate is usually smaller than that of a unified cache. Furthermore, unified lower-level caches usually behave and process memory requests disregarding whether they are data or instruction requests. In this article, we propose a new technique aimed to balance the amount of space devoted to instructions and data for optimizing set-associative caches: the Virtually Split Cache or VSC. Our technique combines the sharing of resources from unified approaches with the bandwidth and parallelism that split configurations provide, thus reducing power consumption while not degrading performance. Our design dynamically adjusts cache resources devoted to instructions and data depending on their particular demand. Two VSC designs are proposed in order to track the instructions and data requirements. The Shadow Tag VSC (ST-VSC) is based on shadow tags that store the last evicted line related to data and instructions in order to determine how well the cache would work with one more way per set devoted to each kind. The Global Selector VSC (GS-VSC) uses a saturation counter that is updated every time a cache miss occurs either under an instruction or data request applying a duel-like mechanism. Experiments with a variable and a fixed latency VSC show that ST-VSC and GS-VSC reduce on average the cache hierarchy power consumption by 29% and 24%, respectively, with respect to a standard baseline. As for performance, while the fixed latency designs virtually match the split baseline in a single-core system, a variable latency ST-VSC and GS-VSC increase the average IPC by 2.5% and 2%, respectively. In multicore systems, even the slower fixed latency ST-VSC and GS-VSC designs improve the baseline IPC by 3.1% and 2.5%, respectively, in a four-core system thanks to the reduction in the bandwidth demanded from the lower cache levels. This is in contrast with many techniques that trade performance degradation for power consumption reduction. VSC particularly benefits embedded processors with a single level of cache, where up to an average 9.2% IPC improvement is achieved. Interestingly, we also find that partitioning the LLC for instructions and data can improve performance around 2%.