A comparative study of the NAS MG benchmark across parallel languages and architectures
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
UPC performance and potential: a NPB experimental study
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Message passing on data-parallel architectures
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Many-Core vs. Many-Thread Machines: Stay Away From the Valley
IEEE Computer Architecture Letters
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Towards efficient GPU sharing on multicore processors
Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Towards efficient GPU sharing on multicore processors
ACM SIGMETRICS Performance Evaluation Review
Exploiting SIMD instructions in current processors to improve classical string algorithms
ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Hi-index | 0.00 |
Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains. However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU accelerated parallel systems. Moreover, the speedup derived from GPU often gets offset by longer communication latencies and inefficient task scheduling. To achieve the best possible performance, a suitable parallel programming model is therefore essential. In this paper, we explore a new hybrid parallel programming model that incorporates GPU acceleration with the Partitioned Global Address Space (PGAS) programming paradigm. As we demonstrate, by combining Unified Parallel C (UPC) and CUDA as a case study, this hybrid model offers programmers with both enhanced programmability and powerful heterogeneous execution. Two application benchmarks, namely NAS Parallel Benchmark (NPB) FT and MG, are used to show the effectiveness of our proposed hybrid approach. Experimental results indicate that both implementations achieve significantly better performance due to optimization opportunities offered by the hybrid model, such as the funneled execution mode and fine-grained overlapping of communication and computation.