Comparing memory systems for chip multiprocessors

Authors:
Jacob Leverich;Hideho Arakida;Alex Solomatnikov;Amin Firoozshahian;Mark Horowitz;Christos Kozyrakis
Affiliations:
Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA;Stanford University, Stanford, CA
Venue:
Proceedings of the 34th annual international symposium on Computer architecture
Year:
2007

Citing 30
Cited 21

A comparison of message passing and shared memory architectures for data parallel programs

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Multithreaded programming with Pthreads

Multithreaded programming with Pthreads
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Blocking and array contraction across arbitrarily nested loops using affine partitioning

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Parallel Computer Architecture: A Hardware/Software Approach

Parallel Computer Architecture: A Hardware/Software Approach
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Shared Memory Consistency Models: A Tutorial

Computer
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
A performance analysis of PIM, stream processing, and tiled processing on memory-intensive signal processing kernels

Proceedings of the 30th annual international symposium on Computer architecture
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

ACM Transactions on Architecture and Code Optimization (TACO)
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Evaluating the Imagine Stream Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Exploring Energy/Performance Tradeoffs in Shared Memory MPSoCs: Snoop-Based Cache Coherence vs. Software Solutions

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
KD-tree acceleration structures for a GPU raytracer

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Maximizing CMP Throughput with Mediocre Cores

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
DRAMsim: a memory system simulator

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Memory hierarchy design for stream computing

Memory hierarchy design for stream computing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
GPU Computing: Programming a Massively Parallel Processor

Proceedings of the International Symposium on Code Generation and Optimization
A Generational Algorithm to Multiprocessor Cache Coherence

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 01
MPEG-2 decoding in a stream programming language

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Comparison of memory write policies for NoC based multicore cache coherent systems

Proceedings of the conference on Design, automation and test in Europe
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
L1 Collective Cache: Managing Shared Data for Chip Multiprocessors

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
A design methodology for domain-optimized power-efficient supercomputing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Proceedings of the 7th ACM international conference on Computing frontiers
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
Hardware/software support for adaptive work-stealing in on-chip multiprocessor

Journal of Systems Architecture: the EUROMICRO Journal
A software-SVM-based transactional memory for multicore accelerator architectures with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Accelerating data movement on future chip multi-processors

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
An instruction-scheduling-aware data partitioning technique for coarse-grained reconfigurable architectures

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Energy-efficient cache coherence protocol for NoC-based MPSoCs

Proceedings of the 24th symposium on Integrated circuits and systems design
Evaluation of low-overhead organizations for the directory in future many-core CMPs

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Hardware/software co-design for energy-efficient seismic modeling

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing
Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors

ACM Transactions on Architecture and Code Optimization (TACO)
Remote store programming: a memory model for embedded multicore

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Hardware-software coherence protocol for the coexistence of caches and local memories

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
NP-SARC: Scalable network processing in the SARC multi-core FPGA platform

Journal of Systems Architecture: the EUROMICRO Journal
Location-aware cache management for many-core processors with deep cache hierarchy

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code.