Inexpensive implementations of set-associativity
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
High-bandwidth data memory systems for superscalar processors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Evaluation of design alternatives for a multiprocessor microprocessor
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
IEEE Micro
Effective Hardware-Based Data Prefetching for High-Performance Processors
IEEE Transactions on Computers
The impact of shared-cache clustering in small-scale shared-memory multiprocessors
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Partitioning of Shared Cache Memory
The Journal of Supercomputing
CQoS: a framework for enabling QoS in shared caches of CMP platforms
Proceedings of the 18th annual international conference on Supercomputing
Coherence decoupling: making use of incoherence
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Managing Wire Delay in Large Chip-Multiprocessor Caches
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
POWER4 system microarchitecture
IBM Journal of Research and Development
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory
Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Application of full-system simulation in exploratory system design and development
IBM Journal of Research and Development
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CMP cache performance projection: accessibility vs. capacity
ACM SIGARCH Computer Architecture News
Performance/area efficiency in chip multiprocessors with micro-caches
Proceedings of the 4th international conference on Computing frontiers
Proximity-aware directory-based coherence for multi-core processor architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Virtual hierarchies to support server consolidation
Proceedings of the 34th annual international symposium on Computer architecture
Interconnect design considerations for large NUCA caches
Proceedings of the 34th annual international symposium on Computer architecture
The Power of Priority: NoC Based Distributed Cache Coherency
NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Cooperative cache partitioning for chip multiprocessors
Proceedings of the 21st annual international conference on Supercomputing
Visions for application development on hybrid computing systems
Parallel Computing
Utilizing shared data in chip multiprocessors with the Nahalal architecture
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Statistical Approach to NoC Design
NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
A novel migration-based NUCA design for chip multiprocessors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Distributed cooperative caching
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Analysis and approximation of optimal co-scheduling on chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
A study on optimally co-scheduling jobs of different lengths on chip multiprocessors
Proceedings of the 6th ACM conference on Computing frontiers
Dynamic cache clustering for chip multiprocessors
Proceedings of the 23rd international conference on Supercomputing
Hybrid cache architecture with disparate memory technologies
Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
NUDA: a non-uniform debugging architecture and non-intrusive race detection for many-core
Proceedings of the 46th Annual Design Automation Conference
No cache-coherence: a single-cycle ring interconnection for multi-core L1-NUCA sharing on 3D chips
Proceedings of the 46th Annual Design Automation Conference
ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of on-chip interconnection networks for large-scale chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
A scalable organization for distributed directories
Journal of Systems Architecture: the EUROMICRO Journal
Contention aware execution: online contention detection and response
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
LRU-PEA: a smart replacement policy for non-uniform cache architectures on chip multiprocessors
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Adaptive L2 cache for chip multiprocessors
Euro-Par'07 Proceedings of the 2007 conference on Parallel processing
Software data spreading: leveraging distributed caches to improve single thread performance
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
The auction: optimizing banks usage in Non-Uniform Cache Architectures
Proceedings of the 24th ACM International Conference on Supercomputing
Area-efficient floorplans and interconnects for homogeneous multi-core architectures
International Journal of High Performance Systems Architecture
Proceedings of the 37th annual international symposium on Computer architecture
International Journal of High Performance Systems Architecture
NoC-aware cache design for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A power-efficient migration mechanism for D-NUCA caches
Proceedings of the Conference on Design, Automation and Test in Europe
Design exploration of hybrid caches with disparate memory technologies
ACM Transactions on Architecture and Code Optimization (TACO)
Comparing last-level cache designs for CMP architectures
Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Power-efficient spilling techniques for chip multiprocessors
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Thread owned block cache: managing latency in many-core architecture
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Enhancing L2 organization for CMPs with a center cell
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors
Journal of Parallel and Distributed Computing
Loaf: a framework and infrastructure for creating online adaptive solutions
Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
A study of 3D Network-on-Chip design for data parallel H.264 coding
Microprocessors & Microsystems
DAPSCO: Distance-aware partially shared cache organization
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Region scheduling: efficiently using the cache architectures via page-level affinity
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Improving coherence protocol reactiveness by trading bandwidth for latency
Proceedings of the 9th conference on Computing Frontiers
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
ACM Transactions on Computer Systems (TOCS)
NOC-Out: Microarchitecting a Scale-Out Processor
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Addressing End-to-End Memory Access Latency in NoC-Based Multicores
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Replacement techniques for dynamic NUCA cache designs on CMPs
The Journal of Supercomputing
Towards efficient dynamic LLC home bank mapping with noc-level support
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Exploiting replication to improve performances of NUCA-based CMP systems
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.