Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Authors:
Daniel Molka;Daniel Hackenberg;Robert Schone;Matthias S. Muller
Affiliations:
-;-;-;-
Venue:
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Year:
2009

Citing 0
Cited 34

The effects of virtualization on main memory systems

Proceedings of the Sixth International Workshop on Data Management on New Hardware
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Memory system performance in a NUMA multicore multiprocessor

Proceedings of the 4th Annual International Conference on Systems and Storage
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management
Scalable aggregation on multicore processors

Proceedings of the Seventh International Workshop on Data Management on New Hardware
Reducing Network-on-Chip energy consumption through spatial locality speculation

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
A study on factors influencing power consumption in multithreaded and multicore CPUs

WSEAS Transactions on Computers
vIOMMU: efficient IOMMU emulation

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Simultaneous multithreading on x86_64 systems: an energy efficiency evaluation

HotPower '11 Proceedings of the 4th Workshop on Power-Aware Computing and Systems
Why nothing matters: the impact of zeroing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
E-AHRW: An Energy-Efficient Adaptive Hash Scheduler for Stream Processing on Multi-core Servers

Proceedings of the 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems
Memory Performance And SPEC OpenMP scalability on quad-socket x86 64 systems

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Parallel main-memory indexing for moving-object query and update workloads

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems

Parallel Computing
Efficient frequent item counting in multi-core hardware

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Reducing last level cache pollution in NUMA multicore systems for improving cache performance

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
Base-delta-immediate compression: practical data compression for on-chip caches

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
BlackjackBench: portable hardware characterization

ACM SIGMETRICS Performance Evaluation Review
Memory performance at reduced CPU clock speeds: an analysis of current x86_64 processors

HotPower'12 Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph coloring algorithms for multi-core and massively multithreaded architectures

Parallel Computing
Approximate weighted matching on emerging manycore and multithreaded architectures

International Journal of High Performance Computing Applications
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
NUMA-aware shared-memory collective communication for MPI

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Whose cache line is it anyway?: operating system support for live detection and repair of false sharing

Proceedings of the 8th ACM European Conference on Computer Systems
On understanding the energy consumption of ARM-based multicore servers

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Understanding parallelism in graph traversal on multi-core clusters

Computer Science - Research and Development
An early performance evaluation of many integrated core architecture based SGI rackable computing system

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ARI: Adaptive LLC-memory traffic management

ACM Transactions on Architecture and Code Optimization (TACO)
Test-driving Intel Xeon Phi

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Characterizing and mitigating work time inflation in task parallel programs

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's microprocessors have complex memory subsystems with several cache levels. The efficient use of this memory hierarchy is crucial to gain optimal performance, especially on multicore processors. Unfortunately, many implementation details of these processors are not publicly available. In this paper we present such fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture. Our analysis is based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem. Special care is taken to control the coherency state of the data to gain insight into performance relevant implementation details of the cache coherency protocol. Based on these benchmarks we present undocumented performance data and architectural properties.