L1 Cache Filtering Through Random Selection of Memory References

Authors:
Yoav Etsion;Dror G. Feitelson
Affiliations:
The Hebrew University of Jerusalem, Israel;The Hebrew University of Jerusalem, Israel
Venue:
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Year:
2007

Citing 0
Cited 7

Less reused filter: improving l2 cache performance via filtering less reused lines

Proceedings of the 23rd international conference on Supercomputing
Improving performance of digest caches in network processors

HiPC'08 Proceedings of the 15th international conference on High performance computing
Branch target buffer design for embedded processors

Microprocessors & Microsystems
Dynamic and adaptive SPM management for a multi-task environment

Journal of Systems Architecture: the EUROMICRO Journal
Dynamic access distance driven cache replacement

ACM Transactions on Architecture and Code Optimization (TACO)
FELI: HW/SW support for on-chip distributed shared memory in multicores

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A register-file approach for row buffer caches in die-stacked DRAMs

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distinguishing transient blocks from frequently used blocks enables servicing references to transient blocks from a small fully-associative auxiliary cache structure. By inserting only frequently used blocks into the main cache structure, we can reduce the number of conflict misses, thus achieving higher performance and allowing the use of direct mapped caches which offer lower power consumption and lower access latencies. We suggest using a simple probabilistic filtering mechanism based on random sampling to identify and select the frequently used blocks. Furthermore, by using a small direct-mapped lookup table to cache the most recently accessed blocks in the auxiliary cache, we eliminate the vast majority of the costly fully-associative lookups. Finally, we show that a 16K direct-mapped L1 cache, augmentedwith a fully-associative 2K filter, achieves on average over 10% more instructions per cycle than a regular 16K, 4-way set-associative cache, and even 隆芦5% more IPC than a 32K, 4-way cache, while consuming 70%-80% less dynamic power than either of them.