Entering the petaflop era: the architecture and performance of Roadrunner

Authors:
Kevin J. Barker;Kei Davis;Adolfy Hoisie;Darren J. Kerbyson;Mike Lang;Scott Pakin;Jose C. Sancho
Affiliations:
Los Alamos National Laboratory, Los Alamos;Los Alamos National Laboratory, Los Alamos;Los Alamos National Laboratory, Los Alamos;Los Alamos National Laboratory, Los Alamos;Los Alamos National Laboratory, Los Alamos;Los Alamos National Laboratory, Los Alamos;Los Alamos National Laboratory, Los Alamos
Venue:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Year:
2008

Citing 11
Cited 56

MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications

International Journal of High Performance Computing Applications
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Cell/B.E. blades: building blocks for scalable, real-time, interactive, and digital media servers

IBM Journal of Research and Development
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Accelerating computing with the cell broadband engine processor

Proceedings of the 5th conference on Computing frontiers

A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Petascale computing with accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Supporting MapReduce on large-scale asymmetric multi-core clusters

ACM SIGOPS Operating Systems Review
Efficient high performance collective communication for the cell blade

Proceedings of the 23rd international conference on Supercomputing
Implementing a hierarchical Bayesian visual cortex model on multi-core processors

Proceedings of the 47th Annual Southeast Regional Conference
Dynamic Load Balancing of Matrix-Vector Multiplications on Roadrunner Compute Nodes

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Visualization-Driven Structural and Statistical Analysis of Turbulent Flows

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Multi-core acceleration of chemical kinetics for simulation and prediction

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Towards a framework for abstracting accelerators in parallel applications: experience with cell

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Using piecewise polynomials for faster potential function evaluation

Journal of Computational Physics
A case study on dynamic kernel adaptation in a component-based infectious disease simulator

Proceedings of the 2009 Workshop on Component-Based High Performance Computing
Modeling advanced collective communication algorithms on cell-based systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Cortical architectures on a GPGPU

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
An asymmetric distributed shared memory model for heterogeneous parallel systems

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
State-of-the-art in heterogeneous computing

Scientific Programming
Remote Process Execution and Remote File I/O for Heterogeneous Processors in Cluster Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Designing Accelerator-Based Distributed Systems for High Performance

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
A Capabilities-Aware Programming Model for Asymmetric High-End Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
IBM BladeCenter QS22: design, performance, and utilization in hybrid computing systems

IBM Journal of Research and Development
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development
Programming the Linpack benchmark for Roadrunner

IBM Journal of Research and Development
Recursion-driven parallel code generation for multi-core platforms

Proceedings of the Conference on Design, Automation and Test in Europe
Vision for cross-layer optimization to address the dual challenges of energy and reliability

Proceedings of the Conference on Design, Automation and Test in Europe
Improving scratchpad allocation with demand-driven data tiling

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A capabilities-aware framework for using computational accelerators in data-intensive computing

Journal of Parallel and Distributed Computing
TH-1: China's first petaflop supercomputer

Frontiers of Computer Science in China
Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

Journal of Computational Physics
HPC environment management: new challenges in the petaflop era

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Programming heterogeneous clusters with accelerators using object-based programming

Scientific Programming
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

Journal of Computational Physics
Reusable software components for accelerator-based clusters

Journal of Systems and Software
Performance modeling for multilevel communication in SHMEM+

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Liszt: a domain specific language for building portable mesh-based PDE solvers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Using the TOP500 to trace and project technology and architecture trends

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An early performance analysis of POWER7-IH HPC systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the simulation of large-scale architectures using multiple application abstraction levels

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing
Hybrid MPI-cell parallelism for hyperbolic PDE simulation on a cell processor cluster

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
The low-power architecture approach towards exascale computing

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Optimizing sweep3d for graphic processor unit

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
SGL: towards a bridging model for heterogeneous hierarchical platforms

International Journal of High Performance Computing and Networking
Analysis of gravitational wave signals on heterogeneous architectures

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
An efficient scheduler of RTOS for multi/many-core system

Computers and Electrical Engineering
Reducing the impact of soft errors on fabric-based collective communications

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Parallel Computing
Multicore acceleration of Discrete Event System Specification systems

Simulation
Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

The Journal of Supercomputing
Scheduling streaming applications on a complex multicore platform

Concurrency and Computation: Practice & Experience
A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems

Future Generation Computer Systems
The Experience in Designing and Evaluating the High Performance Cluster Netuno

International Journal of Parallel Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunner's hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture---the Cell BE---and on multi-core processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured.