Critical words cache memory: exploiting criticality within primary cache miss streams

Authors:
Harold Carter;Edmund J. Gieske
Affiliations:
University of Cincinnati;University of Cincinnati
Venue:
Critical words cache memory: exploiting criticality within primary cache miss streams
Year:
2008

Citing 0
Cited 1

Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

The major constraints on increasing computer performance are power dissipation and memory latency. These have led to increases in secondary cache memory (L2$) capacity to minimize the occurrence of power intensive and slow off-chip main memory accesses. However as they have grown, secondary cache memories have become a large part of the total processor power dissipation, and their access time has increased in terms of processor clock cycles. Most cache memory architecture research has focused on primary cache memory (L1$) or the overall cache hierarchy. In contrast, architectural improvements of the L2$ have usually been simple increases in capacity and associativity. Our research concerns two previously unexamined attributes of L1$ misses and a novel architectural means to reduce the average hit time and power dissipation of L2$ designs without negatively impacting their hit rates. We investigate both a form of sequence regularity in L1$ miss streams and the quantity of critical words within cache blocks as indicators of the potential for memory hierarchy speed and power improvements resulting from segregating the L2$ treatment of so-called critical and non-critical words. We call the form of sequence regularity “critical word regularity” (CWR), the amount of critical words within cache blocks “critical footprint size” (CFS), and cache memories with architectures that exploit CWR and CFS we call “critical words cache” (CW$) memories. We describe practical CW$ architectures, operating methods, and implementation approaches. We show that CW$ memories offer dramatically higher performance than standard cache architectures employing the well-known critical word first bus protocols. Our investigation consisted of four major phases, each of which employed a trace-driven cache simulation experiment. The goal of the first phase was to determine whether significant CWR exists in the load miss stream of a primary data cache memory (L1D$). Having found this to be the case, initial estimates of potential CW$ performance were made. The second phase sought to quantify the CWR and CFS in the load miss streams of the SPEC CPU 2000 collection of benchmark applications across nine L1D$ configurations. The CWR results of the second experiment were then used to estimate both secondary CW$ coverage of L1D$ load misses and the overall performance of a computer system with a memory hierarchy that includes a CW$. The third phase of our investigation built on the second and more completely measured CWR and CFS. The range of benchmarks was expanded in the third phase experiment and the CWR of instruction fetch misses and data store misses were measured in addition to that of data load misses. The CFS distributions were also measured to better estimate the resource requirements for practical CW$ memories. The fourth and final phase of our investigation determined the workload performance improvements obtainable with practical CW$ memories of various capacities, configurations, operating methods, and implementations. We also further explored the cost and performance tradeoffs made possible by exploitation of CWR and CFS using a CW$ secondary cache architecture. Our investigation shows that sufficient CWR exists in both data and instruction miss streams for the segregation of the critical words in L2$ blocks to be worthwhile. The average CWR for all miss types in both SPEC CPU 2000 and 2006 workloads was found to range from almost 40% up to 90%, across a wide range of L1$ configurations. CWR was found to depend primarily on the workload and secondarily on the cache configuration. We also found that on average, more than half of all cache blocks that are repeatedly missed in a L1$ have only one critical word - even in L1$ designs composed of large, 128 byte, blocks. With one exception, in all of the L1$ configurations we examined only one quarter of the words were ever critical words in more than 77% of the repeatedly missed cache blocks in the data load miss streams. We used our CWR and CFS results to estimate that exploitation of criticality in L1$ miss streams by using a secondary CW$ has the potential to cover more than 60% of L1D$ load misses more quickly and efficiently than standard architecture cache memories. Several practical CW$ configurations were found that achieve average L2$ hit coverage in excess of 70%. CW$ hit coverage was also found to scale well, generally increasing with overall cache capacity. (Abstract shortened by UMI.)