Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Authors:
Sreepathi Pai;R. Govindarajan;Matthew J. Thazhuthaveetil
Affiliations:
Indian Institute of Science, Bangalore, India;Indian Institution of Science, Bangalore, India;Indian Institution of Science, Bangalore, India
Venue:
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Year:
2012

Citing 13
Cited 7

Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization

G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems

Proceedings of the 27th international ACM conference on International conference on supercomputing
Design and implementation of the fusion simulator based on multi-shader GPU

Proceedings of the 2013 Research in Adaptive and Convergent Systems
INSPIRE: the insieme parallel intermediate representation

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
RSVM: a region-based software virtual memory for GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Generating efficient data movement code for heterogeneous architectures with distributed-memory

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exploiting the performance potential of GPUs requires managing the data transfers to and from them efficiently which is an error-prone and tedious task. In this paper, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale accesses and uses a runtime to initiate transfers as necessary. This allows us to avoid redundant transfers that are exhibited by all other existing automatic memory management proposals. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only and compiler-only proposals, it also transfers 2.2x to 13.3x less data on average.