Efficiently computing static single assignment form and the control dependence graph
ACM Transactions on Programming Languages and Systems (TOPLAS)
Circle fitting by linear and nonlinear least squares
Journal of Optimization Theory and Applications
Composing high-performance memory allocators
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Adaptive java optimisation using instance-based learning
Proceedings of the 18th annual international conference on Supercomputing
CUBA: an architecture for efficient CPU/co-processor data communication
Proceedings of the 22nd annual international conference on Supercomputing
Software Pipelined Execution of Stream Programs on GPUs
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Profiling General Purpose GPU Applications
SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
IEEE Micro
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Dymaxion: optimizing memory access patterns for heterogeneous systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Compiling a high-level language for GPUs: (via language support for architectures and compilers)
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamically managed data for CPU-GPU architectures
Proceedings of the Tenth International Symposium on Code Generation and Optimization
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Smart, adaptive mapping of parallelism in the presence of external workload
CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Hi-index | 0.00 |
General purpose graphics processors units (GPU) provide the potential for high computational performance with reduced cost and power. Typically they are employed in heterogeneous settings acting as accelerators. Here an application resides on a host multi-core, dispatching work to the GPU. However, workload dispatch is frequently accompanied by large scale data transfers between the host main memory and the dedicated memories of the GPUs. For many applications, memory allocation and communication overhead can severely reduce the benefits of GPU acceleration. This paper develops an approach that reduces host-device communication overhead for OpenCL applications. It does this without modification or recompilation of the application source code and is portable across platforms. It achieves this by tracing and analyzing calls to the runtime made by the application and then selecting the best platform specific memory allocation and communication policy. This approach was applied to 12 existing OpenCL benchmarks from Parboil and Rodinia suites on 3 different platforms where it gives on average a speedup of 1.51, 1.31 and 1.48, respectively. In certain cases, our approach leads up to a factor of three times improvement over current approaches.