Compiling Python to a hybrid execution environment

Authors:
Rahul Garg;José Nelson Amaral
Affiliations:
McGill University;University of Alberta
Venue:
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Year:
2010

Citing 20
Cited 2

Dynamic management of scratch-pad memory space

Proceedings of the 38th annual Design Automation Conference
Efficient and precise array access analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Shader metaprogramming

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Run-time Assisted Interprocedural Analysis of Memory Access Patterns

Run-time Assisted Interprocedural Analysis of Memory Access Patterns
Compiler-decided dynamic memory allocation for scratch-pad based embedded systems

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Hybrid analysis: static & dynamic memory reference analysis

International Journal of Parallel Programming
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
The Stream Virtual Machine

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Memory Coloring: A Compiler Approach for Scratchpad Memory Management

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
PyPy's approach to virtual machine construction

Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
RPython: a step towards reconciling dynamically and statically typed OO languages

Proceedings of the 2007 symposium on Dynamic languages
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Beginning DirectX 10 Game Programming

Beginning DirectX 10 Game Programming
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Tracing the meta-level: PyPy's tracing JIT compiler

Proceedings of the 4th workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems
Towards data tiling for whole programs in scratchpad memory allocation

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture

Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C programming language, and jit4GPU, a just-in-time compiler from C to the AMD CAL interface. Experimental evaluation demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases.