MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Languages and Compilers for Parallel Computing
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
An Evaluation of Vectorizing Compilers
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Improving performance of OpenCL on CPUs
CC'12 Proceedings of the 21st international conference on Compiler Construction
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
Proceedings of the 26th ACM international conference on Supercomputing
Performance characterization of the NAS Parallel Benchmarks in OpenCL
IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Hi-index | 0.00 |
The state-of-the-art ARM processors provide multiple cores and SIMD instructions. OpenCL is a promising programming model for utilizing such parallel processing capability because of its SPMD programming model and built-in vector support. Moreover, it provides portability between multicore ARM processors and accelerators in embedded systems. In this paper, we introduce the design and implementation of an efficient OpenCL framework for multicore ARM processors. Computational tasks in a program are implemented as OpenCL kernels and run on all CPU cores in parallel by our OpenCL framework. Vector operations and built-in functions in OpenCL kernels are optimized using the NEON SIMD instruction set. We evaluate our OpenCL framework using 37 benchmark applications. The result shows that our approach is effective and promising.