Semantical interprocedural parallelization: an overview of the PIPS project
ICS '91 Proceedings of the 5th international conference on Supercomputing
On the parallel implementation of Jacobi and Kogbetliantz algorithms
SIAM Journal on Scientific Computing
Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand
ICPPW '09 Proceedings of the 2009 International Conference on Parallel Processing Workshops
vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines
IEEE Transactions on Computers
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors
Proceedings of the 26th ACM international conference on Supercomputing
Experiments with WRF on intel® many integrated core (intel MIC) architecture
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Hi-index | 0.00 |
Xeon Phi, the latestMany Integrated Core (MIC) co-processor from Intel, packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. One of the easiest way to take advantage of the MIC is to use compiler directives to offoad appropriate compute tasks of an application. However, with the Xeon Phi being an expensive resource, it is believed that production systems will be designed in a heterogeneous manner with only a subset of compute nodes comprising the MIC co-processor. Moreover, not all applications will be able to take advantage of the complete compute power offered by a Xeon Phi. In such scenarios, the existing state-of-the-art frameworks which require applications to be scheduled on compute nodes that have the MIC co- processor, lead to inefficient utilization of the computing power offered by the MIC. In order to address this limitation, it is critical to design an efficient framework to facilitate applications to offload compute tasks on remote MICs. In this paper, we take on this challenge and design MIC-RO - a novel framework to enable efficient remote offload on heterogeneous MIC clusters. To the best of our knowledge, this is the first design that enables application scientists to offload computation to remote MICs. Our experimental results show that, using MIC-RO, applications are able to offload computation to remote MICs with no overhead compared to offloading on local MICs. Moreover, MIC-RO outperforms the default Intel compiler based offload techniques by up to a factor of two for multiple benchmarks and application kernels.