Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
MDR: performance model driven runtime for heterogeneous parallel platforms
Proceedings of the international conference on Supercomputing
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance Optimization of Data Structures Using Memory Access Characterization
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Poster: determining code segments that can benefit from execution on GPUs
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Hi-index | 0.00 |
Heterogeneous architectures (mainstream CPUs with accelerators/co-processors) are expected to become more prevalent in high performance computing clusters. This paper deals specifically with attaining efficient execution on nodes which combine Intel's multicore Sandy Bridge chips with MIC manycore chips. The architecture and software stack for Intel's heterogeneous computation nodes attempt to make migration from the now common multicore chips to the many-core chips straightforward. However, specific execution characteristics are favored by these manycore chips such as making use of the wider vector instructions, minimal inter-thread conflicts, etc. Additionally manycore chips have lower clock speed and no unified last-level cache. As a result, and as we demonstrate in this paper, it will commonly be the case that not all parts of an application will execute more efficiently on the manycore chip than on the multicore chip. This paper presents a process, based on measurements of execution on Westmere-based multicore chips, which can accurately predict which code segments will execute efficiently on the manycore chips and illustrates and evaluates its application to three substantial full programs -- HOMME, MOIL and MILC. The effectiveness of the process is validated by verifying scalability of the specific functions and loops that were recommended for MIC execution on a Knights Ferry computation node.