Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Authors:
Ashwin Prasad;Jayvant Anantpur;R. Govindarajan
Affiliations:
Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India;Indian Institute of Science, Bangalore, India
Venue:
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Year:
2011

Citing 21
Cited 8

Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Techniques for the translation of MATLAB programs into Fortran 90

ACM Transactions on Programming Languages and Systems (TOPLAS)
MaJIC: compiling MATLAB for speed and responsiveness

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Static array storage optimization in MATLAB

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Match Virtual Machine: An Adaptive Runtime System to Execute MATLAB in Parallel

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Automatic Type-Driven Library Generation for Telescoping Languages

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
An algebraic array shape inference system for MATLAB®

ACM Transactions on Programming Languages and Systems (TOPLAS)
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A translator system for the MATLAB language: Research Articles

Software—Practice & Experience
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Synergistic execution of stream programs on multicores with accelerators

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Compilers: Principles, Techniques, & Tools with Gradiance

Compilers: Principles, Techniques, & Tools with Gradiance
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Accelerating the Execution of Matrix Languages on the Cell Broadband Engine Architecture

IEEE Transactions on Parallel and Distributed Systems
Optimizing MATLAB through just-in-time specialization

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Taming MATLAB

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems

Proceedings of the 27th international ACM conference on International conference on supercomputing
First steps to compiling Matlab to X10

Proceedings of the third ACM SIGPLAN X10 Workshop
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.