High-performance implementation of the level-3 BLAS
ACM Transactions on Mathematical Software (TOMS)
Programming model for a heterogeneous x86 platform
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Fast tridiagonal solvers on the GPU
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An asymmetric distributed shared memory model for heterogeneous parallel systems
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
BCYCLIC: A parallel block tridiagonal matrix cyclic solver
Journal of Computational Physics
memCUDA: map device memory to host memory on GPGPU platform
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid
IEEE Transactions on Parallel and Distributed Systems
Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform
ACM SIGOPS Operating Systems Review
A Waterfall Model to Achieve Energy Efficient Tasks Mapping for Large Scale GPU Clusters
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Quantum Chemical Many-Body Theory on Heterogeneous Nodes
SAAHPC '11 Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures
SAAHPC '11 Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing
A Scalable Tridiagonal Solver for GPUs
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
SIAM Journal on Scientific Computing
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
GPU-based parallel algorithms for sparse nonlinear systems
Journal of Parallel and Distributed Computing
LU factorization for accelerator-based systems
AICCSA '11 Proceedings of the 2011 9th IEEE/ACS International Conference on Computer Systems and Applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.