Heterogeneous system coherence for integrated CPU-GPU systems

Authors:
Jason Power;Arkaprava Basu;Junli Gu;Sooraj Puthoor;Bradford M. Beckmann;Mark D. Hill;Steven K. Reinhardt;David A. Wood
Affiliations:
University of Wisconsin -- Madison;University of Wisconsin -- Madison;Advanced Micro Devices, Inc.;Advanced Micro Devices, Inc.;Advanced Micro Devices, Inc.;University of Wisconsin -- Madison and Advanced Micro Devices, Inc.;Advanced Micro Devices, Inc.;University of Wisconsin -- Madison and Advanced Micro Devices, Inc.
Venue:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2013

Citing 19
Cited 0

An economical solution to the cache coherence problem

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Improving the accuracy of snoop filtering using stream registers

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
TurboTag: lookup filtering to reduce coherence directory power

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Subspace snooping: filtering snoops with operating system support

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
The gem5 simulator

ACM SIGARCH Computer Architecture News
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Why on-chip cache coherence is here to stay

Communications of the ACM
Spatiotemporal Coherence Tracking

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Cache coherence for GPU architectures

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, but throughput-oriented GPUs can overwhelm CPUs with coherence requests not well-filtered by caches. Meanwhile, region coherence has been proposed for CPU-only systems to reduce snoop bandwidth by obtaining coherence permissions for large regions. This paper develops Heterogeneous System Coherence (HSC) for CPU-GPU systems to mitigate the coherence bandwidth effects of GPU memory requests. HSC replaces a standard directory with a region directory and adds a region buffer to the L2 cache. These structures allow the system to move bandwidth from the coherence network to the high-bandwidth direct-access bus without sacrificing coherence. Evaluation results with a subset of Rodinia benchmarks and the AMD APP SDK show that HSC can improve performance compared to a conventional directory protocol by an average of more than 2x and a maximum of more than 4.5x. Additionally, HSC reduces the bandwidth to the directory by an average of 94% and by more than 99% for four of the analyzed benchmarks.