Comparative performance evaluation of cache-coherent NUMA and COMA architectures
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor
Digital Technical Journal - Special 10th anniversary issue
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The case for a single-chip multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Reactive NUMA: a design for unifying S-COMA and CC-NUMA
Proceedings of the 24th annual international symposium on Computer architecture
Generating representative Web workloads for network and server performance evaluation
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A fully associative software-managed cache design
Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Parallel Computer Architecture: A Hardware/Software Approach
Parallel Computer Architecture: A Hardware/Software Approach
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
A low-overhead coherence solution for multiprocessors with private cache memories
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Variability in Architectural Simulations of Multi-Threaded Workloads
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Managing Wire Delay in Large Chip-Multiprocessor Caches
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Exploiting the Cache Capacity of a Single-Chip Multi-Core Processor with Execution Migration
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory
Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CMP cache performance projection: accessibility vs. capacity
ACM SIGARCH Computer Architecture News
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Virtual hierarchies to support server consolidation
Proceedings of the 34th annual international symposium on Computer architecture
Interconnect design considerations for large NUCA caches
Proceedings of the 34th annual international symposium on Computer architecture
Cooperative cache partitioning for chip multiprocessors
Proceedings of the 21st annual international conference on Supercomputing
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Analysis of static and dynamic energy consumption in NUCA caches: initial results
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Adaptive set pinning: managing shared caches in chip multiprocessors
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Embedded processors and systems: Architectural issues and solutions for emerging applications
Journal of Embedded Computing - Embeded Processors and Systems: Architectural Issues and Solutions for Emerging Applications
Software-directed combined cpu/link voltage scaling fornoc-based cmps
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Utilizing shared data in chip multiprocessors with the Nahalal architecture
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
SP-NUCA: a cost effective dynamic non-uniform cache architecture
ACM SIGARCH Computer Architecture News
Towards hybrid last level caches for chip-multiprocessors
ACM SIGARCH Computer Architecture News
A novel migration-based NUCA design for chip multiprocessors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Improving support for locality and fine-grain sharing in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Dynamic cache clustering for chip multiprocessors
Proceedings of the 23rd international conference on Supercomputing
Push-assisted migration of real-time tasks in multi-core processors
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Hybrid cache architecture with disparate memory technologies
Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
Triplet-based topology for on-chip networks
WSEAS Transactions on Computers
Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
L1 Collective Cache: Managing Shared Data for Chip Multiprocessors
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Dynamic thread and data mapping for NoC based CMPs
Proceedings of the 46th Annual Design Automation Conference
Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches
Journal of Systems Architecture: the EUROMICRO Journal
Optimizing shared cache behavior of chip multiprocessors
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Compiler-based data classification for hybrid caching
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Constraint-aware large-scale CMP cache design
HiPC'07 Proceedings of the 14th international conference on High performance computing
Performance and energy trade-offs analysis of L2 on-chip cache architectures for embedded MPSoCs
Proceedings of the 20th symposium on Great lakes symposium on VLSI
Proceedings of the 7th ACM international conference on Computing frontiers
The auction: optimizing banks usage in Non-Uniform Cache Architectures
Proceedings of the 24th ACM International Conference on Supercomputing
International Journal of High Performance Systems Architecture
Handling the problems and opportunities posed by multiple on-chip memory controllers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Power and performance of read-write aware hybrid caches with non-volatile memories
Proceedings of the Conference on Design, Automation and Test in Europe
Group-caching for NoC based multicore cache coherent systems
Proceedings of the Conference on Design, Automation and Test in Europe
Design exploration of hybrid caches with disparate memory technologies
ACM Transactions on Architecture and Code Optimization (TACO)
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient address mapping of shared cache for on-chip many-core architecture
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Enhancing L2 organization for CMPs with a center cell
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Architectural support for thread communications in multi-core processors
Parallel Computing
Proposal and evaluation of APIs for utilizing inter-core time aggregation scheduler
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors
Journal of Parallel and Distributed Computing
Studying inter-core data reuse in multicores
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Studying inter-core data reuse in multicores
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Evaluating placement policies for managing capacity sharing in CMP architectures with private caches
ACM Transactions on Architecture and Code Optimization (TACO)
A cache-partitioning aware replacement policy for chip multiprocessors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
L2-Cache hierarchical organizations for multi-core architectures
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Reducing energy and increasing performance with traffic optimization in many-core systems
Proceedings of the System Level Interconnect Prediction Workshop
Performance/Thermal-Aware Design of 3D-Stacked L2 Caches for CMPs
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Extrinsic and intrinsic text cloning
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Neighborhood-aware data locality optimization for NoC-based multicores
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Energy-guided exploration of on-chip network design for exa-scale computing
Proceedings of the International Workshop on System Level Interconnect Prediction
Practically private: enabling high performance CMPs through compiler-assisted data classification
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A collaborative memory system for high-performance and cost-effective clustered architectures
Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Survey of scheduling techniques for addressing shared resources in multicore processors
ACM Computing Surveys (CSUR)
Jigsaw: scalable software-defined caches
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Exploiting replication to improve performances of NUCA-based CMP systems
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.