Analysis of static and dynamic energy consumption in NUCA caches: initial results
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
SP-NUCA: a cost effective dynamic non-uniform cache architecture
ACM SIGARCH Computer Architecture News
SlackSim: a platform for parallel simulations of CMPs on CMPs
ACM SIGARCH Computer Architecture News
Compiler-based data classification for hybrid caching
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Multi-CMP system with data communication on the fly
The Journal of Supercomputing
A scalable multiprocessor architecture for pervasive computing
GPC'11 Proceedings of the 6th international conference on Advances in grid and pervasive computing
Reducing energy and increasing performance with traffic optimization in many-core systems
Proceedings of the System Level Interconnect Prediction Workshop
CMP off-chip bandwidth scheduling guided by instruction criticality
Proceedings of the 27th international ACM conference on International conference on supercomputing
A survey on cache tuning from a power/energy perspective
ACM Computing Surveys (CSUR)
Addressing the challenges of future large-scale many-core architectures
Proceedings of the ACM International Conference on Computing Frontiers
Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Locality-aware task management for unstructured parallelism: a quantitative limit study
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Jigsaw: scalable software-defined caches
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.