Comparative evaluation of memory models for chip multiprocessors

Authors:
Jacob Leverich;Hideho Arakida;Alex Solomatnikov;Amin Firoozshahian;Mark Horowitz;Christos Kozyrakis
Affiliations:
Stanford University;Stanford University;Stanford University;Stanford University;Stanford University;Stanford University
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2008

Citing 34
Cited 3

Integrating message-passing and shared-memory: early experience

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
A comparison of message passing and shared memory architectures for data parallel programs

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Multithreaded programming with Pthreads

Multithreaded programming with Pthreads
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Shared Memory Consistency Models: A Tutorial

Computer
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels

Proceedings of the 30th annual international symposium on Computer architecture
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Evaluating the Imagine Stream Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Exploring Energy/Performance Tradeoffs in Shared Memory MPSoCs: Snoop-Based Cache Coherence vs. Software Solutions

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
KD-tree acceleration structures for a GPU raytracer

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Memory hierarchy design for stream computing

Memory hierarchy design for stream computing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
A Generational Algorithm to Multiprocessor Cache Coherence

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01
Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
Architectural Support for the Stream Execution Model on General-Purpose Processors

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
MPEG-2 decoding in a stream programming language

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Using a configurable processor generator for computer architecture prototyping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Memory-, bandwidth-, and power-aware multi-core for a graph database workload

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Hardware-enabled dynamic resource allocation for manycore systems using bidding-based system feedback

EURASIP Journal on Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are two competing models for the on-chip memory in Chip Multiprocessor (CMP) systems: hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two models under the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications on systems with up to 16 cores, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and nonallocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.