Dynamic Load Balancing of Matrix-Vector Multiplications on Roadrunner Compute Nodes

Authors:
José Carlos Sancho;Darren J. Kerbyson
Affiliations:
Los Alamos National Laboratory, Performance and Architecture Laboratory (PAL), USA 87545;Los Alamos National Laboratory, Performance and Architecture Laboratory (PAL), USA 87545
Venue:
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Year:
2009

Citing 9
Cited 0

Customized dynamic load balancing for a network of workstations

Journal of Parallel and Distributed Computing
Load Balancing in Parallel Computers: Theory and Practice

Load Balancing in Parallel Computers: Theory and Practice
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Sparse Matrix Computations on Reconfigurable Hardware

Computer
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hybrid architectures that combine general purpose processors with accelerators are currently being adopted in several large-scale systems such as the petaflop Roadrunner supercomputer at Los Alamos. In this system, dual-core Opteron host processors are tightly coupled with PowerXCell 8i accelerator processors within each compute node. In this kind of hybrid architecture, an accelerated mode of operation is typically used to off-load performance hotspots in the computation to the accelerators. In this paper we explore the suitability of a variant of this acceleration mode in which the performance hotspots are actually shared between the host and the accelerators. To achieve this we have designed a new load balancing algorithm, which is optimized for the Roadrunner compute nodes, to dynamically distribute computation and associated data between the host and the accelerators at runtime. Results are presented using this approach, for sparse and dense matrix-vector multiplications, that show load-balancing can improve performance by up to 24% over solely using the accelerators.