Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
IEEE Transactions on Parallel and Distributed Systems
LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
A Static Scheduling Heuristic for Heterogeneous Processors
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Mapping heterogeneous task graphs onto heterogeneous system graphs
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Dynamic thread assignment on heterogeneous multiprocessor architectures
Proceedings of the 3rd conference on Computing frontiers
X10: concurrent programming for modern architectures
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling Parallel Task Graphs on (Almost) Homogeneous Multicluster Platforms
IEEE Transactions on Parallel and Distributed Systems
Automatic Offloading of C++ for the Cell BE Processor: A Case Study Using Offload
CISIS '10 Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Induction variable analysis with delayed abstractions
HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Mapping a data-flow programming model onto heterogeneous platforms
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors
Proceedings of the 26th ACM international conference on Supercomputing
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
Proceedings of the 26th ACM international conference on Supercomputing
Productive Programming of GPU Clusters with OmpSs
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Early evaluation of directive-based GPU programming models for productive exascale computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic generation of software pipelines for heterogeneous parallel systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing a unified programming model for heterogeneous machines
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Work division between the processor and accelerator is a common theme in modern heterogenous computing. Recent efforts (such as LEO and OpenAcc) provide directives that allow the developer to mark code regions in the original application from which offloadable tasks can be generated by the compiler. Auto-tuners and runtime schedulers work with the options (i.e., offloadable tasks) generated at compile time, which is limited by the directives specified by the developer. There is no provision for offload restructuring. We propose a new directive to add relaxed semantics to directive-based languages. The compiler identifies and generates one or more offloadable tasks in the neighbourhood of the code region marked by the directive. Central to our contribution is the idea of sub-offload and super-offload. In sub-offload, only a part of the code region marked by the developer is offloaded to the accelerator, while the other part executes on the CPU in parallel. This is done by splitting the index range of the main parallel loop into two or more parts and declaring one of the subloops as the offloadable task. Support is added to handle reduction variables and critical sections across subloops. Sub-offload enables concurrent execution of a task on the CPU and accelerator. In super-offload, a code region larger than the one specified by the developer is declared as the offloadable task (e.g., a parent loop). Super-offload reduces data transfers between CPU and accelerator memory. We develop Elastic Offload Compiler(EOC) for use alongside existing directive-based languages. The current implementation supports LEO for the new Intel Xeon Phi (MIC) architecture. We evaluate EOC with respect to SpecOMP and NAS Parallel Benchmarks. Speedups range between 1.3x-4.4x with the CPU version as baseline and 1.2x-24x with the offload (CPU-MIC) version as baseline.