Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Authors:
Jayanth Gummaraju;Laurent Morichetti;Michael Houston;Ben Sander;Benedict R. Gaster;Bixia Zheng
Affiliations:
Advanced Micro Devices, Sunnyvale, CA, USA;Advanced Micro Devices, Sunnyvale, CA, USA;Advanced Micro Devices, Sunnyvale, CA, USA;Advanced Micro Devices, Sunnyvale, CA, USA;Advanced Micro Devices, Sunnyvale, CA, USA;Advanced Micro Devices, Sunnyvale, CA, USA
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 18
Cited 17

Scheduler activations: effective kernel support for the user-level management of parallelism

ACM Transactions on Computer Systems (TOCS)
Implementing network protocols at user level

IEEE/ACM Transactions on Networking (TON)
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Heterogeneous Chip Multiprocessors

Computer
An electric fence for kernel buffers

Proceedings of the 2005 ACM workshop on Storage security and survivability
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Accelerating computing with the cell broadband engine processor

Proceedings of the 5th conference on Computing frontiers
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Languages and Compilers for Parallel Computing
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture

Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Caracal: dynamic translation of runtime environments for GPUs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Improving performance of OpenCL on CPUs

CC'12 Proceedings of the 21st international conference on Compiler Construction
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Proceedings of the 26th ACM international conference on Supercomputing
Harmony: collection and analysis of parallel block vectors

Proceedings of the 39th Annual International Symposium on Computer Architecture
Bit-parallel multiple pattern matching

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

International Journal of Grid and High Performance Computing
Embassies: radically refactoring the web

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms

Proceedings of the ACM International Conference on Computing Frontiers
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic OpenCL work-group size selection for multicore CPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
OpenCL framework for ARM processors with NEON support

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs. In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs efficiently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multicore CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets.