Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming model for a heterogeneous x86 platform
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Dynamically managed data for CPU-GPU architectures
Proceedings of the Tenth International Symposium on Code Generation and Optimization
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems
Proceedings of the 27th international ACM conference on International conference on supercomputing
Design and implementation of the fusion simulator based on multi-shader GPU
Proceedings of the 2013 Research in Adaptive and Convergent Systems
INSPIRE: the insieme parallel intermediate representation
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
RSVM: a region-based software virtual memory for GPU
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Generating efficient data movement code for heterogeneous architectures with distributed-memory
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic data allocation and buffer management for multi-GPU machines
ACM Transactions on Architecture and Code Optimization (TACO)
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms
Scientific Programming
Hi-index | 0.00 |
Exploiting the performance potential of GPUs requires managing the data transfers to and from them efficiently which is an error-prone and tedious task. In this paper, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale accesses and uses a runtime to initiate transfers as necessary. This allows us to avoid redundant transfers that are exhibited by all other existing automatic memory management proposals. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only and compiler-only proposals, it also transfers 2.2x to 13.3x less data on average.