An OpenCL framework for heterogeneous multicores with local memory

Authors:
Jaejin Lee;Jungwon Kim;Sangmin Seo;Seungkyun Kim;Jungho Park;Honggyu Kim;Thanh Tuan Dao;Yongjin Cho;Sung Jong Seo;Seung Hak Lee;Seung Mo Cho;Hyo Jung Song;Sang-Bum Suh;Jong-Deok Choi
Affiliations:
Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Samsung Electronics, Yongin-si, South Korea;Samsung Electronics, Yongin-si, South Korea;Samsung Electronics, Yongin-si, South Korea;Samsung Electronics, Yongin-si, South Korea;Samsung Electronics, Yongin-si, South Korea;Samsung Electronics, Yongin-si, South Korea
Venue:
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Year:
2010

Citing 23
Cited 12

A technique for summarizing data access and its use in parallelism enhancing transformations

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
Advanced compiler design and implementation

Advanced compiler design and implementation
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
The SPMD Model: Past, Present and Future

Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
TreadMarks: distributed shared memory on standard workstations and operating systems

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Portable multithreading: the signal stack trick for user-space thread creation

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Using many-core hardware to correlate radio astronomy signals

Proceedings of the 23rd international conference on Supercomputing
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)

Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
OpenCL as a unified programming model for heterogeneous CPU/GPU clusters

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Proceedings of the 26th ACM international conference on Supercomputing
SnuCL and an MPI+OpenCL implementation of HPL on heterogeneous CPU/GPU clusters

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
Optimizing Techniques for OpenCL Programs on Heterogeneous Platforms

International Journal of Grid and High Performance Computing
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic OpenCL work-group size selection for multicore CPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
OpenCL framework for ARM processors with NEON support

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.