Accelerating Habanero-Java programs with OpenCL generation

Authors:
Akihiro Hayashi;Max Grossman;Jisheng Zhao;Jun Shirako;Vivek Sarkar
Affiliations:
Rice University, Houston, TX;Rice University, Houston, TX;Rice University, Houston, TX;Rice University, Houston, TX;Rice University, Houston, TX
Venue:
Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools
Year:
2013

Citing 18
Cited 0

Automatic loop transformations and parallelization for Java

Proceedings of the 14th international conference on Supercomputing
From flop to megaflops: Java for technical computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
ABCD: eliminating array bounds checks on demand

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Java Concurrency in Practice

Java Concurrency in Practice
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Array bounds check elimination for the Java HotSpot™ client compiler

Proceedings of the 5th international symposium on Principles and practice of programming in Java
Type inference for locality analysis of distributed data structures

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization

Proceedings of the 22nd annual international conference on Supercomputing
Safe bounds check annotations

Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
Work-first and help-first scheduling policies for async-finish task parallelism

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Subregion analysis and bounds check elimination for high level arrays

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Firepile: run-time compilation for GPUs in scala

Proceedings of the 10th ACM international conference on Generative programming and component engineering
Delegated isolation

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Habanero-Java: the new adventures of old X10

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Rootbeer: Seamlessly Using GPUs from Java

HPCC '12 Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The initial wave of programming models for general-purpose computing on GPUs, led by CUDA and OpenCL, has provided experts with low-level constructs to obtain significant performance and energy improvements on GPUs. However, these programming models are characterized by a challenging learning curve for non-experts due to their complex and low-level APIs. Looking to the future, improving the accessibility of GPUs and accelerators for mainstream software developers is crucial to bringing the benefits of these heterogeneous architectures to a broader set of application domains. A key challenge in doing so is that mainstream developers are accustomed to working with high-level managed languages, such as Java, rather than lower-level native languages such as C, CUDA, and OpenCL. The OpenCL standard enables portable execution of SIMD kernels across a wide range of platforms, including multi-core CPUs, many-core GPUs, and FPGAs. However, using OpenCL from Java to program multi-architecture systems is difficult and error-prone. Programmers are required to explicitly perform a number of low-level operations, such as (1) managing data transfers between the host system and the GPU, (2) writing kernels in the OpenCL kernel language, (3) compiling kernels & performing other OpenCL initialization, and (4) using the Java Native Interface (JNI) to access the C/C++ APIs for OpenCL. In this paper, we present compile-time and run-time techniques for accelerating programs written in Java using automatic generation of OpenCL as a foundation. Our approach includes (1) automatic generation of OpenCL kernels and JNI glue code from a parallel-for construct (forall) available in the Habanero-Java (HJ) language, (2) leveraging HJ's array view language construct to efficiently support rectangular, multi-dimensional arrays on OpenCL devices, and (3) implementing HJ's phaser (next) construct for all-to-all barrier synchronization in automatically generated OpenCL kernels. Our approach is named HJ-OpenCL. Contrasting with past approaches to generating CUDA or OpenCL from high-level languages, the HJ-OpenCL approach helps the programmer preserve Java exception semantics by generating both exception-safe and unsafe code regions. The execution of one or the other is selected at runtime based on the safe language construct introduced in this paper. We use a set of ten Java benchmarks to evaluate our approach, and observe performance improvements due to both native OpenCL execution and parallelism. On an AMD APU, our results show speedups of up to 36.7× relative to sequential Java when executing on the host 4-core CPU, and of up to 55.0x on the integrated GPU. For a system with an Intel Xeon CPU and a discrete NVIDIA Fermi GPU, the speedups relative to sequential Java are 35.7× for the 12-core CPU and 324.0× for the GPU. Further, we find that different applications perform optimally in JVM execution, in OpenCL CPU execution, and in OpenCL GPU execution. The language features, compiler extensions, and runtime extensions included in this work enable portability, rapid prototyping, and transparent execution of JVM applications across all OpenCL platforms.