ParFUM: a parallel framework for unstructured meshes for scalable dynamic physics applications
Engineering with Computers
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Towards dense linear algebra for hybrid GPU accelerated manycore systems
Parallel Computing
Scaling Hierarchical N-body Simulations on GPU Clusters
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Tiling stencil computations to maximize parallelism
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing memory transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers and most optimization strategies are often proposed and tuned specifically for individual applications. In this paper, we present G-Charm, a generic framework with an adaptive runtime system for efficient execution of message-driven parallel applications on hybrid systems. The framework is based on Charm++, a message-driven programming environment and runtime for parallel applications. The techniques in our framework include dynamic scheduling of work on CPU and GPU cores, maximizing reuse of data present in GPU memory, data management in GPU memory, and combining multiple kernels. We have presented results using our framework on Tesla S1070 and Fermi C2070 systems using three classes of applications: a highly regular and parallel 2D Jacobi solver, a regular dense matrix Cholesky factorization representing linear algebra computations with dependencies among parallel computations and highly irregular molecular dynamics simulations. With our generic framework, we obtain 1.5 to 15 times improvement over previous GPU-based implementation of Charm++. We also obtain about 14\% improvement over an implementation of Cholesky factorization with a static work-distribution scheme.