A scalable framework for heterogeneous GPU-based clusters

Authors:
Fengguang Song;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville, TN, USA;University of Tennessee, Knoxville, TN, USA
Venue:
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Year:
2012

Citing 14
Cited 2

LAPACK's user's guide

LAPACK's user's guide
ScaLAPACK user's guide

ScaLAPACK user's guide
A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers)

IEEE Transactions on Computers
Data distribution for dense factorization on computers with memory heterogeneity

Parallel Computing
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Accelerating linpack with CUDA on heterogenous clusters

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
Making a case for a green500 list

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community

Computing in Science and Engineering
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium

Generating efficient data movement code for heterogeneous architectures with distributed-memory

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework is able to attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.