The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
IEEE Transactions on Parallel and Distributed Systems
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
An efficient cache design for scalable glueless shared-memory multiprocessors
Proceedings of the 3rd conference on Computing frontiers
POWER5 System microarchitecture
IBM Journal of Research and Development - POWER5 and packaging
A novel lightweight directory architecture for scalable shared-memory multiprocessors
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
The large working sets of commercial and scientific workloads favor a shared L2 cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests in Chip Multiprocessors (CMP). The exponential increase in the number of cores results in the commensurate increase in the memory cost of directory, restricting its scalability severely. To resolve this hurdle, a novel Lightweight Shared Cache design is proposed in this paper, which applies two small fast caches to store and manage the data and directory vectors for the blocks recently cached by L1 caches in each tile of CMP. The proposed cache scheme removes the directory vectors from L2 cache, thus decreases on-chip directory memory overhead and improves the scalability. Moreover, the proposed cache scheme brings significant reductions in terms of the L1 cache miss latencies, which lead to the improvement of program performance by 6% on average, and up to 16% at best, with 0.18% storage overhead.