Semi-automatic restructuring of offloadable tasks for many-core accelerators

Authors:
Nishkam Ravi;Yi Yang;Tao Bao;Srimat Chakradhar
Affiliations:
NEC Labs America, Princeton, NJ;NEC Labs America, Princeton, NJ;Purdue University;NEC Labs America, Princeton, NJ
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 35
Cited 0

Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
A Static Scheduling Heuristic for Heterogeneous Processors

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
Mapping heterogeneous task graphs onto heterogeneous system graphs

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Dynamic thread assignment on heterogeneous multiprocessor architectures

Proceedings of the 3rd conference on Computing frontiers
X10: concurrent programming for modern architectures

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling Parallel Task Graphs on (Almost) Homogeneous Multicluster Platforms

IEEE Transactions on Parallel and Distributed Systems
Automatic Offloading of C++ for the Cell BE Processor: A Case Study Using Offload

CISIS '10 Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Induction variable analysis with delayed abstractions

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Mapping a data-flow programming model onto heterogeneous platforms

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Proceedings of the 26th ACM international conference on Supercomputing
Productive Programming of GPU Clusters with OmpSs

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Tiling stencil computations to maximize parallelism

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing a unified programming model for heterogeneous machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Work division between the processor and accelerator is a common theme in modern heterogenous computing. Recent efforts (such as LEO and OpenAcc) provide directives that allow the developer to mark code regions in the original application from which offloadable tasks can be generated by the compiler. Auto-tuners and runtime schedulers work with the options (i.e., offloadable tasks) generated at compile time, which is limited by the directives specified by the developer. There is no provision for offload restructuring. We propose a new directive to add relaxed semantics to directive-based languages. The compiler identifies and generates one or more offloadable tasks in the neighbourhood of the code region marked by the directive. Central to our contribution is the idea of sub-offload and super-offload. In sub-offload, only a part of the code region marked by the developer is offloaded to the accelerator, while the other part executes on the CPU in parallel. This is done by splitting the index range of the main parallel loop into two or more parts and declaring one of the subloops as the offloadable task. Support is added to handle reduction variables and critical sections across subloops. Sub-offload enables concurrent execution of a task on the CPU and accelerator. In super-offload, a code region larger than the one specified by the developer is declared as the offloadable task (e.g., a parent loop). Super-offload reduces data transfers between CPU and accelerator memory. We develop Elastic Offload Compiler(EOC) for use alongside existing directive-based languages. The current implementation supports LEO for the new Intel Xeon Phi (MIC) architecture. We evaluate EOC with respect to SpecOMP and NAS Parallel Benchmarks. Speedups range between 1.3x-4.4x with the CPU version as baseline and 1.2x-24x with the offload (CPU-MIC) version as baseline.