Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Authors:
Gregory Frederick Diamos;Andrew Robert Kerr;Sudhakar Yalamanchili;Nathan Clark
Affiliations:
Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 20
Cited 40

A bridging model for parallel computation

Communications of the ACM
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
The Transmeta Code Morphing™ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Jikes research virtual machine project: building an open-source research community

IBM Systems Journal
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Valgrind: a framework for heavyweight dynamic binary instrumentation

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
DIGITAL FX!32 running 32-bit ×86 applications on alpha NT

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
PIN: a binary instrumentation tool for computer architecture research and education

WCAE '04 Proceedings of the 2004 workshop on Computer architecture education: held in conjunction with the 31st International Symposium on Computer Architecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Predictive Runtime Code Scheduling for Heterogeneous Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A characterization and analysis of PTX kernels

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Modeling GPU-CPU workloads and systems

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Hierarchical place trees: a portable abstraction for task parallelism and data movement

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Attaining system performance points: revisiting the end-to-end argument in system design for heterogeneous many-core systems

ACM SIGOPS Operating Systems Review
Caracal: dynamic translation of runtime environments for GPUs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Parallel pattern detection for architectural improvements

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts

Proceedings of the 8th ACM International Conference on Computing Frontiers
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GKLEE: concolic verification and test generation for GPUs

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A compile-time managed multi-level register file hierarchy

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems (TOCS)
Characterization and transformation of unstructured control flow in bulk synchronous GPU applications

International Journal of High Performance Computing Applications
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Improvements to the structural simulation toolkit

Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
iGPU: exception support and speculative execution on GPUs

Proceedings of the 39th Annual International Symposium on Computer Architecture
VForce: An environment for portable applications on high performance systems with accelerators

Journal of Parallel and Distributed Computing
Gdev: first-class GPU resource management in the operating system

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Multi2Sim: a simulation framework for CPU-GPU computing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
ValuePack: value-based scheduling framework for CPU-GPU clusters

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Spill code placement for SIMD machines

SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Multi-tenancy on GPGPU-based servers

Proceedings of the 7th international workshop on Virtualization technologies in distributed computing
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic OpenCL work-group size selection for multicore CPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Scheduling concurrent applications on a cluster of CPU-GPU nodes

Future Generation Computer Systems
Characterizing the challenges and evaluating the efficacy of a CUDA-to-OpenCL translator

Parallel Computing
Efficient Instrumentation of GPGPU Applications Using Information Flow Analysis and Symbolic Execution

Proceedings of Workshop on General Purpose Processing Using GPUs
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming
A visual approach to investigating shared and global memory behavior of CUDA kernels

EuroVis '13 Proceedings of the 15th Eurographics Conference on Visualization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks [1], the Virginia Rodinia benchmarks [2], the GPU-VSIPL signal and image processing library [3], the Thrust library [4], and several domain specific applications. This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.