Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Authors:
Alfred J. Park;Kalyan S. Perumalla
Affiliations:
-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 19
Cited 0

High-performance implementation of the level-3 BLAS

ACM Transactions on Mathematical Software (TOMS)
Programming model for a heterogeneous x86 platform

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Efficient simulation of agent-based models on multi-GPU and multi-core clusters

Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
BCYCLIC: A parallel block tridiagonal matrix cyclic solver

Journal of Computational Physics
memCUDA: map device memory to host memory on GPGPU platform

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid

IEEE Transactions on Parallel and Distributed Systems
Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform

ACM SIGOPS Operating Systems Review
A Waterfall Model to Achieve Energy Efficient Tasks Mapping for Large Scale GPU Clusters

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Quantum Chemical Many-Body Theory on Heterogeneous Nodes

SAAHPC '11 Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures

SAAHPC '11 Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing
A Scalable Tridiagonal Solver for GPUs

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids

SIAM Journal on Scientific Computing
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
GPU-based parallel algorithms for sparse nonlinear systems

Journal of Parallel and Distributed Computing
LU factorization for accelerator-based systems

AICCSA '11 Proceedings of the 2011 9th IEEE/ACS International Conference on Computer Systems and Applications
Hierarchical partitioning algorithm for scientific computing on highly heterogeneous CPU + GPU clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation-communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.