The reverse-acceleration model for programming petascale hybrid systems

Authors:
S. Pakin;M. Lang;D. J. Kerbyson
Affiliations:
Los Alamos National Laboratory, Los Alamos, New Mexico;Los Alamos National Laboratory, Los Alamos, New Mexico;Los Alamos National Laboratory, Los Alamos, New Mexico
Venue:
IBM Journal of Research and Development
Year:
2009

Citing 23
Cited 2

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Scalability Analysis of Multidimensional Wavefront Algorithms on Large-Scale SMP Clusters

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
MPI Microtask for programming the cell broadband engineTM processor

IBM Systems Journal
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
De Novo Ultrascale Atomistic Simulations On High-End Parallel Supercomputers

International Journal of High Performance Computing Applications
Cell/B.E. blades: building blocks for scalable, real-time, interactive, and digital media servers

IBM Journal of Research and Development
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
The PlayStation 3 for High-Performance Scientific Computing

Computing in Science and Engineering
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Supporting OpenMP on cell

International Journal of Parallel Programming
Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Application profiling on Cell-based clusters

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A synchronous mode MPI implementation on the cell BETM architecture

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current technology trends favor hybrid architectures, typically with each node in a cluster containing both general-purpose and specialized accelerator processors. The typical model for programming such systems is host-centric: The general-purpose processor orchestrates the computation, offloading performancecritical work to the accelerator, and data are communicated only among general-purpose processors. In this paper, we propose a radically different hybrid-programming approach, which we call the reverse-acceleration model. In this model, the accelerators orchestrate the computation, offloading work that cannot be accelerated to the general-purpose processors. Data is communicated among accelerators, not among general-purpose processors. Our thesis is that the reverse-acceleration model simplifies porting codes to hybrid systems and facilitates performance optimization. We present a case study of a legacy neutron-transport code that we modified to use reverse acceleration and ran across the full 122,400 cores (general-purpose plus accelerator) of the Los Alamos National Laboratory Roadrunner supercomputer. Results indicate a substantial performance improvement over the unaccelerated version of the code.