The case for a single-chip multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Cache-Only Memory Architectures
Computer
Exploring the Design Space of Future CMPs
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Utilization of Cache Area in On-Chip Multiprocessor
ISHPC '99 Proceedings of the Second International Symposium on High Performance Computing
Optimizing pipelines for power and performance
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling
Proceedings of the 30th annual international symposium on Computer architecture
Optimum Power/Performance Pipeline Depth
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Proceedings of the 31st annual international symposium on Computer architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Accelerating Multiprocessor Simulation with a Memory Timestamp Record
ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
POWER4 system microarchitecture
IBM Journal of Research and Development
Design and Management of 3D Chip Multiprocessors Using Network-in-Memory
Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CMP cache performance projection: accessibility vs. capacity
ACM SIGARCH Computer Architecture News
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Proximity-aware directory-based coherence for multi-core processor architectures
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Virtual hierarchies to support server consolidation
Proceedings of the 34th annual international symposium on Computer architecture
Interconnect design considerations for large NUCA caches
Proceedings of the 34th annual international symposium on Computer architecture
The Power of Priority: NoC Based Distributed Cache Coherency
NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Cooperative cache partitioning for chip multiprocessors
Proceedings of the 21st annual international conference on Supercomputing
ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Adaptive set pinning: managing shared caches in chip multiprocessors
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Utilizing shared data in chip multiprocessors with the Nahalal architecture
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
SP-NUCA: a cost effective dynamic non-uniform cache architecture
ACM SIGARCH Computer Architecture News
Towards hybrid last level caches for chip-multiprocessors
ACM SIGARCH Computer Architecture News
The Journal of Supercomputing
A novel migration-based NUCA design for chip multiprocessors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Distributed cooperative caching
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Improving support for locality and fine-grain sharing in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Dynamic cache clustering for chip multiprocessors
Proceedings of the 23rd international conference on Supercomputing
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
Scalable directory architecture for distributed shared memory chip multiprocessors
ACM SIGARCH Computer Architecture News
An Efficient Lightweight Shared Cache Design for Chip Multiprocessors
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
A Novel Cache Organization for Tiled Chip Multiprocessor
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches
Journal of Systems Architecture: the EUROMICRO Journal
ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of on-chip interconnection networks for large-scale chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Compiler-based data classification for hybrid caching
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
A scalable organization for distributed directories
Journal of Systems Architecture: the EUROMICRO Journal
Software—Practice & Experience
HiPC'07 Proceedings of the 14th international conference on High performance computing
Proceedings of the 7th ACM international conference on Computing frontiers
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Cache topology aware computation mapping for multicores
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Software data spreading: leveraging distributed caches to improve single thread performance
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
The auction: optimizing banks usage in Non-Uniform Cache Architectures
Proceedings of the 24th ACM International Conference on Supercomputing
Proceedings of the 37th annual international symposium on Computer architecture
Data marshaling for multi-core architectures
Proceedings of the 37th annual international symposium on Computer architecture
Journal of Systems Architecture: the EUROMICRO Journal
Handling the problems and opportunities posed by multiple on-chip memory controllers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
ATAC: a 1000-core cache-coherent processor with on-chip optical network
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Group-caching for NoC based multicore cache coherent systems
Proceedings of the Conference on Design, Automation and Test in Europe
Power-efficient spilling techniques for chip multiprocessors
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Enhancing L2 organization for CMPs with a center cell
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Characterizing the impact of process variation on 45 nm NoC-based CMPs
Journal of Parallel and Distributed Computing
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors
Journal of Parallel and Distributed Computing
A case for globally shared-medium on-chip interconnect
Proceedings of the 38th annual international symposium on Computer architecture
BarrierWatch: characterizing multithreaded workloads across and within program-defined epochs
Proceedings of the 8th ACM International Conference on Computing Frontiers
Evaluating placement policies for managing capacity sharing in CMP architectures with private caches
ACM Transactions on Architecture and Code Optimization (TACO)
Evaluation of low-overhead organizations for the directory in future many-core CMPs
Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
A low-latency modular switch for CMP systems
Microprocessors & Microsystems
A cache-partitioning aware replacement policy for chip multiprocessors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
L2-Cache hierarchical organizations for multi-core architectures
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Reducing energy and increasing performance with traffic optimization in many-core systems
Proceedings of the System Level Interconnect Prediction Workshop
Region scheduling: efficiently using the cache architectures via page-level affinity
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
A modular simulator framework for network-on-chip based manycore chips using UNISIM
Transactions on High-Performance Embedded Architectures and Compilers IV
Locality & utility co-optimization for practical capacity management of shared last level caches
Proceedings of the 26th ACM international conference on Supercomputing
Practically private: enabling high performance CMPs through compiler-assisted data classification
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A collaborative memory system for high-performance and cost-effective clustered architectures
Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Concerning with on-chip network features to improve cache coherence protocols for CMPs
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
ACM Transactions on Computer Systems (TOCS)
The locality-aware adaptive cache coherence protocol
Proceedings of the 40th Annual International Symposium on Computer Architecture
Dynamic directories: a mechanism for reducing on-chip interconnect power in multicores
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Jigsaw: scalable software-defined caches
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Exploiting replication to improve performances of NUCA-based CMP systems
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16%for multi-threaded benchmarks and 24%for single-threaded benchmarks, providing better overall performance than either private or shared schemes.