A bridging model for parallel computation
Communications of the ACM
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Valgrind: a framework for heavyweight dynamic binary instrumentation
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
DIGITAL FX!32 running 32-bit ×86 applications on alpha NT
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
PIN: a binary instrumentation tool for computer architecture research and education
WCAE '04 Proceedings of the 2004 workshop on Computer architecture education: held in conjunction with the 31st International Symposium on Computer Architecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Predictive Runtime Code Scheduling for Heterogeneous Architectures
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Proceedings of the 36th annual international symposium on Computer architecture
Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A characterization and analysis of PTX kernels
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Modeling GPU-CPU workloads and systems
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Hierarchical place trees: a portable abstraction for task parallelism and data movement
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
ACM SIGOPS Operating Systems Review
Caracal: dynamic translation of runtime environments for GPUs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A framework for dynamically instrumenting GPU compute applications within GPU Ocelot
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Parallel pattern detection for architectural improvements
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts
Proceedings of the 8th ACM International Conference on Computing Frontiers
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GKLEE: concolic verification and test generation for GPUs
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A compile-time managed multi-level register file hierarchy
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
ACM Transactions on Computer Systems (TOCS)
International Journal of High Performance Computing Applications
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Improvements to the structural simulation toolkit
Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors
Proceedings of the 26th ACM international conference on Supercomputing
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
iGPU: exception support and speculative execution on GPUs
Proceedings of the 39th Annual International Symposium on Computer Architecture
VForce: An environment for portable applications on high performance systems with accelerators
Journal of Parallel and Distributed Computing
Gdev: first-class GPU resource management in the operating system
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Multi2Sim: a simulation framework for CPU-GPU computing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
ValuePack: value-based scheduling framework for CPU-GPU clusters
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Spill code placement for SIMD machines
SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Multi-tenancy on GPGPU-based servers
Proceedings of the 7th international workshop on Virtualization technologies in distributed computing
Microarchitectural mechanisms to exploit value structure in SIMT architectures
Proceedings of the 40th Annual International Symposium on Computer Architecture
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation
Proceedings of the 40th Annual International Symposium on Computer Architecture
A large-scale cross-architecture evaluation of thread-coarsening
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic OpenCL work-group size selection for multicore CPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
ACM Transactions on Programming Languages and Systems (TOPLAS)
Scheduling concurrent applications on a cluster of CPU-GPU nodes
Future Generation Computer Systems
Proceedings of Workshop on General Purpose Processing Using GPUs
Boosting CUDA Applications with CPU---GPU Hybrid Computing
International Journal of Parallel Programming
A visual approach to investigating shared and global memory behavior of CUDA kernels
EuroVis '13 Proceedings of the 15th Eurographics Conference on Visualization
Hi-index | 0.00 |
Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (PTX) to many-core processors that leverages the Low Level Virtual Machine (LLVM) code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks [1], the Virginia Rodinia benchmarks [2], the GPU-VSIPL signal and image processing library [3], the Thrust library [4], and several domain specific applications. This paper presents a high level overview of the implementation of the Ocelot dynamic compiler highlighting design decisions and trade-offs, and showcasing their effect on application performance. Several novel code transformations are explored that are applicable only when compiling explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class of applications. This study is expected to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.