Petascale computing with accelerators

Authors:
Michael Kistler;John Gunnels;Daniel Brokenshire;Brad Benton
Affiliations:
IBM Corporation, Austin, TX, USA;IBM Corporation, Yorktown, NY, USA;IBM Corporation, Austin, TX, USA;IBM Corporation, Austin, TX, USA
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 13
Cited 3

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Scalability issues affecting the design of a dense linear algebra library

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
Performance Evaluation of Allgather Algorithms On Terascale Linux Cluster with Fast Ethernet

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Introduction to the cell broadband engine architecture

IBM Journal of Research and Development
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Multi-threading and one-sided communication in parallel LU factorization

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
369 Tflop/s molecular dynamics simulations on the Roadrunner general-purpose heterogeneous supercomputer

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Programming the Linpack benchmark for Roadrunner

IBM Journal of Research and Development

Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Scientific Programming - High Performance Computing with the Cell Broadband Engine
State-of-the-art in heterogeneous computing

Scientific Programming
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation

Quantified Score

Hi-index	0.00

Visualization

Abstract

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell™" 8i1 accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.