Cache Operations by MRU Change
IEEE Transactions on Computers
Cache performance of operating system and multiprogramming workloads
ACM Transactions on Computer Systems (TOCS)
Inexpensive implementations of set-associativity
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Column-associative caches: a technique for reducing the miss rate of direct-mapped caches
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Next cache line and set prediction
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Two-ported cache alternatives for superscalar processors
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
The multicluster architecture: reducing cycle time through partitioning
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Speculation techniques for improving load related instruction scheduling
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Decoupling local variable accesses in a wide-issue superscalar processor
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Access region locality for high-bandwidth processor memory system design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Back-end assignment schemes for clustered multithreaded processors
Proceedings of the 18th annual international conference on Supercomputing
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures
Proceedings of the 18th annual international conference on Supercomputing
Cache organizations for clustered microarchitectures
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Store Buffer Design in First-Level Multibanked Data Caches
Proceedings of the 32nd annual international symposium on Computer Architecture
Understanding the energy efficiency of SMT and CMP with multiclustering
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Distributed Data Cache Designs for Clustered VLIW Processors
IEEE Transactions on Computers
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Stack oriented data cache filtering
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Access region cache with register guided memory reference partitioning
Journal of Systems Architecture: the EUROMICRO Journal
Stack filter: Reducing L1 data cache power consumption
Journal of Systems Architecture: the EUROMICRO Journal
L1 data cache power reduction using a forwarding predictor
PATMOS'10 Proceedings of the 20th international conference on Integrated circuit and system design: power and timing modeling, optimization and simulation
Dynamic partition of memory reference instructions – a register guided approach
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Virtually split cache: An efficient mechanism to distribute instructions and data
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
The high clock frequencies of modern superscalar processors make the wire delay incurred in moving data across the processor chip a significant concern. As frequencies continue to increase, it will become more difficult for a centralized first level data cache to supply the timely data bandwidth required by superscalar processors.This paper presents a complete solution for the partitioning of the first level of the memory hierarchy. The first level data cache is split into several independent partitions, which are arbitrarily distributable across the processor die. After being decoded, memory instructions are sent to the reservation stations of the functional unit adjacent to the cache partition that they are most likely to access. The partition assignments for both static instructions and cache data are dynamically changed to adapt to data access patterns. A data cache line is permitted to reside in only one partition at a time, allowing each store to update only a single partition, and allowing the partitioning and simplification of the memory disambiguation logic. The partitioned cache achieves a reduction in cache access latency through a combination of reduced wire delay and reduced cache array size. A partitioned cache with eight 8KB direct-mapped partitions maintains a hit rate greater than that of a 32KB direct-mapped cache. A machine utilizing the partitioned cache outperforms a machine with a conventional 64KB direct-mapped cache by 4.5% and a machine with a 64KB 8-way set-associative cache by 7.0%, when cache latencies estimated through the use of the CACTI cache simulation tool are taken into account.