Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Authors:
Bharath Pichai;Lisa Hsu;Abhishek Bhattacharjee
Affiliations:
Rutgers University, Piscataway, NJ, USA;Qualcomm Research, Raleigh, NC, USA;Rutgers University, Piscataway, NJ, USA
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 39
Cited 0

Surpassing the TLB performance of superpages with less operating system support

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Performance of the VAX-11/780 translation buffer: simulation and measurement

ACM Transactions on Computer Systems (TOCS)
A look at several memory management units, TLB-refill mechanisms, and page table organizations

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Efficient conditional operations for data-parallel architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Going the distance for TLB prefetching: an application-driven study

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
U-cache: a cost-effective solution to synonym problem

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
The Vector-Thread Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Practical, transparent operating system support for superpages

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Inter-core cooperative TLB for chip multiprocessors

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Translation caching: skip, don't walk (the page table)

Proceedings of the 37th annual international symposium on Computer architecture
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
SpecTLB: a mechanism for speculative address translation

Proceedings of the 38th annual international symposium on Computer architecture
Thread block compaction for efficient SIMT control flow

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Shared last-level TLBs for chip multiprocessors

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
IOMMU: strategies for mitigating the IOTLB bottleneck

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Supporting virtual memory in GPGPU without supporting precise exceptions

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
iGPU: exception support and speculative execution on GPUs

Proceedings of the 39th Annual International Symposium on Computer Architecture
Reducing memory reference energy with opportunistic virtual caching

Proceedings of the 39th Annual International Symposium on Computer Architecture
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
CoLT: Coalesced Large-Reach TLBs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Convolution engine: balancing efficiency & flexibility in specialized computing

Proceedings of the 40th Annual International Symposium on Computer Architecture
Thin servers with smart pipes: designing SoC accelerators for memcached

Proceedings of the 40th Annual International Symposium on Computer Architecture
Triggered instructions: a control paradigm for spatially-programmed architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Efficient virtual memory for big memory servers

Proceedings of the 40th Annual International Symposium on Computer Architecture
Navigating big data with high-throughput, energy-efficient data partitioning

Proceedings of the 40th Annual International Symposium on Computer Architecture
A new perspective for efficient virtual-cache coherence

Proceedings of the 40th Annual International Symposium on Computer Architecture
Cache coherence for GPU architectures

HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory. To this end, we are the first to explore GPU Memory Management Units(MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems. We show the performance challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1 parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15\% of runtime). We presume this initial design leaves room for improvement but anticipate that our bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this fruitful area.