Reactive NUMA: a design for unifying S-COMA and CC-NUMA
Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads
Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Reducing cache misses using hardware and software page placement
ICS '99 Proceedings of the 13th international conference on Supercomputing
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Route packets, not wires: on-chip inteconnection networks
Proceedings of the 38th annual Design Automation Conference
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA
HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Token coherence: decoupling performance and correctness
Proceedings of the 30th annual international symposium on Computer architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Managing Wire Delay in Large Chip-Multiprocessor Caches
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory coherence activity prediction in commercial workloads
WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors
Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs
Proceedings of the 32nd annual international symposium on Computer Architecture
Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Fast and fair: data-stream quality of service
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
A NUCA substrate for flexible CMP cache sharing
Proceedings of the 19th annual international conference on Supercomputing
Maximizing CMP Throughput with Mediocre Cores
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Store Memory-Level Parallelism Optimizations for Commercial Applications
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
A flexible data to L2 cache mapping approach for future multicore processors
Proceedings of the 2006 workshop on Memory system performance and correctness
ASR: Adaptive Selective Replication for CMP Caches
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual hierarchies to support server consolidation
Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency
Proceedings of the 34th annual international symposium on Computer architecture
An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Utilizing shared data in chip multiprocessors with the Nahalal architecture
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
SP-NUCA: a cost effective dynamic non-uniform cache architecture
ACM SIGARCH Computer Architecture News
Towards hybrid last level caches for chip-multiprocessors
ACM SIGARCH Computer Architecture News
A novel migration-based NUCA design for chip multiprocessors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Micro-pages: increasing DRAM efficiency with locality-aware data placement
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Compiler-based data classification for hybrid caching
Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
The auction: optimizing banks usage in Non-Uniform Cache Architectures
Proceedings of the 24th ACM International Conference on Supercomputing
Proceedings of the 37th annual international symposium on Computer architecture
Cohesion: a hybrid memory model for accelerators
Proceedings of the 37th annual international symposium on Computer architecture
A data placement strategy in scientific cloud workflows
Future Generation Computer Systems
Subspace snooping: filtering snoops with operating system support
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Handling the problems and opportunities posed by multiple on-chip memory controllers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Compiler-assisted data distribution for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
NoC-aware cache design for chip multiprocessors
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Data-oriented transaction execution
Proceedings of the VLDB Endowment
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors
Journal of Parallel and Distributed Computing
Brief announcement: distributed shared memory based on computation migration
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
Proceedings of the 38th annual international symposium on Computer architecture
The impact of memory subsystem resource sharing on datacenter applications
Proceedings of the 38th annual international symposium on Computer architecture
PLP: page latch-free shared-everything OLTP
Proceedings of the VLDB Endowment
DAPSCO: Distance-aware partially shared cache organization
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Reducing energy and increasing performance with traffic optimization in many-core systems
Proceedings of the System Level Interconnect Prediction Workshop
Clearing the clouds: a study of emerging scale-out workloads on modern hardware
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A data layout optimization framework for NUCA-based multicores
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Memory management for many-core processors with software configurable locality policies
Proceedings of the 2012 international symposium on Memory Management
An automatic code overlaying technique for multicores with explicitly-managed memory hierarchies
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Locality & utility co-optimization for practical capacity management of shared last level caches
Proceedings of the 26th ACM international conference on Supercomputing
Boosting mobile GPU performance with a decoupled access/execute fragment processor
Proceedings of the 39th Annual International Symposium on Computer Architecture
Proceedings of the 39th Annual International Symposium on Computer Architecture
End-to-end sequential consistency
Proceedings of the 39th Annual International Symposium on Computer Architecture
Proceedings of the VLDB Endowment
Practically private: enabling high performance CMPs through compiler-assisted data classification
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Complexity-effective multicore coherence
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Survey of scheduling techniques for addressing shared resources in multicore processors
ACM Computing Surveys (CSUR)
Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors
ACM Transactions on Computer Systems (TOCS)
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Stream arbitration: Towards efficient bandwidth utilization for emerging on-chip interconnects
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
ACM Transactions on Computer Systems (TOCS)
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
NOC-Out: Microarchitecting a Scale-Out Processor
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Addressing End-to-End Memory Access Latency in NoC-Based Multicores
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Spatiotemporal Coherence Tracking
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Low-Latency Mechanisms for Near-Threshold Operation of Private Caches in Shared Memory Multicores
MICROW '12 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture Workshops
Scalable and dynamically balanced shared-everything OLTP with physiological partitioning
The VLDB Journal — The International Journal on Very Large Data Bases
Replacement techniques for dynamic NUCA cache designs on CMPs
The Journal of Supercomputing
A survey on cache tuning from a power/energy perspective
ACM Computing Surveys (CSUR)
Proceedings of the 40th Annual International Symposium on Computer Architecture
The locality-aware adaptive cache coherence protocol
Proceedings of the 40th Annual International Symposium on Computer Architecture
A new perspective for efficient virtual-cache coherence
Proceedings of the 40th Annual International Symposium on Computer Architecture
Non-race concurrency bug detection through order-sensitive critical sections
Proceedings of the 40th Annual International Symposium on Computer Architecture
LP-NUCA: networks-in-cache for high-performance low-power embedded processors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Dynamic directories: a mechanism for reducing on-chip interconnect power in multicores
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Jigsaw: scalable software-defined caches
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Towards efficient dynamic LLC home bank mapping with noc-level support
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Multi-grain coherence directories
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Ubik: efficient cache sharing with strict qos for latency-critical workloads
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Exploiting replication to improve performances of NUCA-based CMP systems
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
DP&TB: a coherence filtering protocol for many-core chip multiprocessors
The Journal of Supercomputing
Hi-index | 0.00 |
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts. In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.