Optimizing Replication, Communication, and Capacity Allocation in CMPs

Authors:
Zeshan Chishti;Michael D. Powell;T. N. Vijaykumar
Affiliations:
Purdue University;Purdue University;Purdue University
Venue:
Proceedings of the 32nd annual international symposium on Computer Architecture
Year:
2005

Citing 20
Cited 74

Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor

Digital Technical Journal - Special 10th anniversary issue
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Reactive NUMA: a design for unifying S-COMA and CC-NUMA

Proceedings of the 24th annual international symposium on Computer architecture
Generating representative Web workloads for network and server performance evaluation

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A fully associative software-managed cache design

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Trends in Shared Memory Multiprocessing

Computer
Simics: A Full System Simulation Platform

Computer
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
TLC: Transmission Line Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Exploiting the Cache Capacity of a Single-Chip Multi-Core Processor with Execution Migration

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture

A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
A flexible data to L2 cache mapping approach for future multicore processors

Proceedings of the 2006 workshop on Memory system performance and correctness
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CMP cache performance projection: accessibility vs. capacity

ACM SIGARCH Computer Architecture News
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Interconnect design considerations for large NUCA caches

Proceedings of the 34th annual international symposium on Computer architecture
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Exploring Large-Scale CMP Architectures Using ManySim

IEEE Micro
Analysis of static and dynamic energy consumption in NUCA caches: initial results

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Embedded processors and systems: Architectural issues and solutions for emerging applications

Journal of Embedded Computing - Embeded Processors and Systems: Architectural Issues and Solutions for Emerging Applications
Software-directed combined cpu/link voltage scaling fornoc-based cmps

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Utilizing shared data in chip multiprocessors with the Nahalal architecture

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
SP-NUCA: a cost effective dynamic non-uniform cache architecture

ACM SIGARCH Computer Architecture News
Towards hybrid last level caches for chip-multiprocessors

ACM SIGARCH Computer Architecture News
A novel migration-based NUCA design for chip multiprocessors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Improving support for locality and fine-grain sharing in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
A leakage-aware cache sharing technique for low-power chip multi-processors (CMPs) with private L2 caches

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Dynamic cache clustering for chip multiprocessors

Proceedings of the 23rd international conference on Supercomputing
Push-assisted migration of real-time tasks in multi-core processors

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Hybrid cache architecture with disparate memory technologies

Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Triplet-based topology for on-chip networks

WSEAS Transactions on Computers
A centralized supply voltage and local body bias-based compensation approach to mitigate within-die process variation

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
L1 Collective Cache: Managing Shared Data for Chip Multiprocessors

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Dynamic thread and data mapping for NoC based CMPs

Proceedings of the 46th Annual Design Automation Conference
Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches

Journal of Systems Architecture: the EUROMICRO Journal
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Compiler-based data classification for hybrid caching

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Constraint-aware large-scale CMP cache design

HiPC'07 Proceedings of the 14th international conference on High performance computing
Performance and energy trade-offs analysis of L2 on-chip cache architectures for embedded MPSoCs

Proceedings of the 20th symposium on Great lakes symposium on VLSI
NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies

Proceedings of the 7th ACM international conference on Computing frontiers
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
Way adaptable D-NUCA caches

International Journal of High Performance Systems Architecture
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Power and performance of read-write aware hybrid caches with non-volatile memories

Proceedings of the Conference on Design, Automation and Test in Europe
Group-caching for NoC based multicore cache coherent systems

Proceedings of the Conference on Design, Automation and Test in Europe
Design exploration of hybrid caches with disparate memory technologies

ACM Transactions on Architecture and Code Optimization (TACO)
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient address mapping of shared cache for on-chip many-core architecture

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Enhancing L2 organization for CMPs with a center cell

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Architectural support for thread communications in multi-core processors

Parallel Computing
Proposal and evaluation of APIs for utilizing inter-core time aggregation scheduler

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

Journal of Parallel and Distributed Computing
Studying inter-core data reuse in multicores

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Evaluating placement policies for managing capacity sharing in CMP architectures with private caches

ACM Transactions on Architecture and Code Optimization (TACO)
A cache-partitioning aware replacement policy for chip multiprocessors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
L2-Cache hierarchical organizations for multi-core architectures

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Reducing energy and increasing performance with traffic optimization in many-core systems

Proceedings of the System Level Interconnect Prediction Workshop
Performance/Thermal-Aware Design of 3D-Stacked L2 Caches for CMPs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Extrinsic and intrinsic text cloning

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Neighborhood-aware data locality optimization for NoC-based multicores

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Energy-guided exploration of on-chip network design for exa-scale computing

Proceedings of the International Workshop on System Level Interconnect Prediction
Practically private: enabling high performance CMPs through compiler-assisted data classification

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A collaborative memory system for high-performance and cost-effective clustered architectures

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
Jigsaw: scalable software-defined caches

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Exploiting replication to improve performances of NUCA-based CMP systems

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.