Scheduler activations: effective kernel support for the user-level management of parallelism
ACM Transactions on Computer Systems (TOCS)
Implementing network protocols at user level
IEEE/ACM Transactions on Networking (TON)
Cache-conscious data placement
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Power Efficient Processor Architecture and The Cell Processor
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Heterogeneous Chip Multiprocessors
Computer
An electric fence for kernel buffers
Proceedings of the 2005 ACM workshop on Storage security and survivability
Accelerator: using data parallelism to program GPUs for general-purpose uses
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Accelerating computing with the cell broadband engine processor
Proceedings of the 5th conference on Computing frontiers
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Languages and Compilers for Parallel Computing
Programming model for a heterogeneous x86 platform
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Caracal: dynamic translation of runtime environments for GPUs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Improving performance of OpenCL on CPUs
CC'12 Proceedings of the 21st international conference on Compiler Construction
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
Proceedings of the 26th ACM international conference on Supercomputing
Harmony: collection and analysis of parallel block vectors
Proceedings of the 39th Annual International Symposium on Computer Architecture
Bit-parallel multiple pattern matching
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
CUDA-for-clusters: a system for efficient execution of CUDA kernels on multi-core clusters
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms
International Journal of Grid and High Performance Computing
Embassies: radically refactoring the web
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Glinda: a framework for accelerating imbalanced applications on heterogeneous platforms
Proceedings of the ACM International Conference on Computing Frontiers
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic OpenCL work-group size selection for multicore CPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
OpenCL framework for ARM processors with NEON support
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Boosting CUDA Applications with CPU---GPU Hybrid Computing
International Journal of Parallel Programming
Hi-index | 0.00 |
Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the GPUs. In this paper, we present Twin Peaks, a software platform for heterogeneous computing that executes code originally targeted for GPUs efficiently on CPUs as well. This permits a more balanced execution between the CPU and GPU, and enables portability of code between these architectures and to CPU-only environments. We propose several techniques in the runtime system to efficiently utilize the caches and functional units present in CPUs. Using OpenCL as a canonical language for heterogeneous computing, and running several experiments on real hardware, we show that our techniques enable GPGPU-style code to execute efficiently on multicore CPUs with minimal runtime overhead. These results also show that for maximum performance, it is beneficial for applications to utilize both CPUs and GPUs as accelerator targets.