Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor

Authors:
Pat Conway;Nathan Kalyanasundharam;Gregg Donley;Kevin Lepak;Bill Hughes
Affiliations:
Advanced Micro Devices;Advanced Micro Devices;Advanced Micro Devices;Advanced Micro Devices;Advanced Micro Devices
Venue:
IEEE Micro
Year:
2010

Citing 0
Cited 32

Garbage collection for multicore NUMA machines

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
How soccer players would do stream joins

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Two-hop free-space based optical interconnects for chip multiprocessors

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Proceedings of the 38th annual international symposium on Computer architecture
Why nothing matters: the impact of zeroing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Memory Performance And SPEC OpenMP scalability on quad-socket x86 64 systems

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Building a scalable and portable message-passing library for embedded multicore systems

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
An efficient software shared virtual memory for the single-chip cloud computer

Proceedings of the Second Asia-Pacific Workshop on Systems
Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Why on-chip cache coherence is here to stay

Communications of the ACM
Improving coherence protocol reactiveness by trading bandwidth for latency

Proceedings of the 9th conference on Computing Frontiers
Virtualization of reconfigurable coprocessors in HPRC systems with multicore architecture

Journal of Systems Architecture: the EUROMICRO Journal
Node-based memory management for scalable NUMA architectures

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Characterizing and Understanding PDES Behavior on Tilera Architecture

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Spatiotemporal Coherence Tracking

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
The locality-aware adaptive cache coherence protocol

Proceedings of the 40th Annual International Symposium on Computer Architecture
Interference resilient PDES on multi-core systems: towards proportional slowdown

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
LP-NUCA: networks-in-cache for high-performance low-power embedded processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Scaling LAPACK panel operations using parallel cache assignment

ACM Transactions on Mathematical Software (TOMS)
Ordering circuit establishment in multiplane NoCs

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Design and performance evaluation of NUMA-aware RDMA-based end-to-end data transfer systems

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Everything you always wanted to know about synchronization but were afraid to ask

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Building expressive, area-efficient coherence directories

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A generic high-performance method for deinterleaving scientific data

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Throughput-memory footprint trade-off in synthesis of streaming software on embedded multiprocessors

ACM Transactions on Embedded Computing Systems (TECS)
PAIS: Parallelism-aware interconnect scheduling in multicores

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.02

Visualization

Abstract

The 12-core AMD Opteron processor, code-named "Magny Cours," combines advances in silicon, packaging, interconnect, cache coherence protocol, and server architecture to increase the compute density of high-volume commodity 2P/4P blade servers while operating within the same power envelope as earlier-generation AMD Opteron processors. A key enabling feature, the probe filter, reduces both the bandwidth overhead of traditional broadcast-based coherence and memory latency.