Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Authors:
Nishkam Ravi;Yi Yang;Tao Bao;Srimat Chakradhar
Affiliations:
NEC Laboratories, Princeton, NJ, USA;North Carolina State University, Raleigh, NC, USA;Purdue University, West Lafayette, IN, USA;NEC Laboratories, Princeton, NJ, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 17
Cited 4

Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Interprocedural array region analyses

International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Symbolic bounds analysis of pointers, array indices, and accessed memory regions

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A unified approach to global program optimization

POPL '73 Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages
CIGAR: Application Partitioning for a CPU/Coprocessor Architecture

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Compiler Analysis of the Value Ranges for Variables

IEEE Transactions on Software Engineering
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
CUBA: an architecture for efficient CPU/co-processor data communication

Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic Offloading of C++ for the Cell BE Processor: A Case Study Using Offload

CISIS '10 Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Induction variable analysis with delayed abstractions

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
A hybrid approach of OpenMP for clusters

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand

Proceedings of the 27th international ACM conference on International conference on supercomputing
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Intel MIC (Many Integrated Core) is the first x86-based coprocessor architecture aimed at accelerating multi-core HPC applications. In the most common usage model, parallel code sections are offloaded to the MIC coprocessor using LEO (Language Extensions for Offload). The developer is responsible for identifying and specifying offloadable code regions, managing data transfers between the CPU and MIC and optimizing the application for performance, which requires some amount of effort and experimentation. In this paper, we present Apricot, an optimizing compiler and productivity tool for x86-compatible many-core coprocessors (such as Intel MIC) that minimizes developer effort by (i) automatically inserting LEO clauses for parallelizable code regions, (ii) selectively offloading some of the code regions to the coprocessor at runtime based on a cost model that we have developed, (iii) applying a set ofoptimizations for minimizing the data communication overhead and improving overall performance. Apricot is intended to assist programmers in porting existing multi-core applications and writing new ones to take advantage of the many-core coprocessor, while maximizing overall performance. Experiments with SpecOMP and NAS Parallel benchmarks show that Apricot can successfully transform OpenMP applications to run on the MIC coprocessor with good performance gains.