The PERCS High-Performance Interconnect

Authors:
Baba Arimilli;Ravi Arimilli;Vicente Chung;Scott Clark;Wolfgang Denzel;Ben Drerup;Torsten Hoefler;Jody Joyner;Jerry Lewis;Jian Li;Nan Ni;Ram Rajamony
Affiliations:
-;-;-;-;-;-;-;-;-;-;-;-
Venue:
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Year:
2010

Citing 0
Cited 25

Generic topology mapping strategies for large-scale parallel architectures

Proceedings of the international conference on Supercomputing
Cache injection for parallel applications

Proceedings of the 20th international symposium on High performance distributed computing
PERCS: the IBM power7-IH high-performance computing system

IBM Journal of Research and Development
Performance modeling for systematic performance tuning

State of the Practice Reports
An early performance analysis of POWER7-IH HPC systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Avoiding hot-spots on two-level direct networks

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ACM SRC poster: optimizing all-to-all algorithm for PERCS network using simulation

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Visualization of simulation results for the PERCS Hub chip performance verification

Proceedings of the 4th International ICST Conference on Simulation Tools and Techniques
Runtime detection and optimization of collective communication patterns

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Looking under the hood of the IBM blue gene/Q network

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cray cascade: a scalable HPC system based on a Dragonfly network

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic communication coalescing for irregular computations in UPC language

CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
The impact of global communication latency at extreme scales on Krylov methods

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Collectives on two-tier direct networks

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Improving communication in PGAS environments: static and dynamic coalescing in UPC

Proceedings of the 27th international ACM conference on International conference on supercomputing
The power 775 architecture at scale

Proceedings of the 27th international ACM conference on International conference on supercomputing
Evaluating on-die interconnects for a 4 TB/s router

Proceedings of the 27th international ACM conference on International conference on supercomputing
Distributed full switch as an ideal system area network for multiprocessor computers

Automation and Remote Control
Global misrouting policies in two-level hierarchical networks

Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip
High and stable performance under adverse traffic patterns of tori-connected torus network

Computers and Electrical Engineering
Enabling highly-scalable remote memory access programming with MPI-3 one sided

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable high-radix router microarchitecture using a network switch organization

ACM Transactions on Architecture and Code Optimization (TACO)
X10 and APGAS at Petascale

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance implications of remote-only load balancing under adversarial traffic in Dragonflies

Proceedings of the 8th International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
Understanding system design for big data workloads

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

The PERCS system was designed by IBM in response to a DARPA challenge that called for a high-productivity high-performance computing system. A major innovation in the PERCS design is the network that is built using Hub chips that are integrated into the compute nodes. Each Hub chip is about 580 mm$^2$ in size, % uses 45 nm IBM CMOS 12S0 SOI technology with 13 levels of metal, has over 3700 signal I/Os, and is packaged in a module that also contains LGA-attached optical electronic devices. The Hub module implements five types of high-bandwidth interconnects with multiple links that are fully-connected with a high-performance internal crossbar switch. These links provide over 9 Tbits/second of raw bandwidth and are used to construct a two-level direct-connect topology spanning up to tens of thousands of \PS{} chips with high bisection bandwidth and low latency. The Blue Waters System, which is being constructed at NCSA, is an exemplar large-scale PERCS installation. Blue Waters is expected to deliver sustained Pet scale performance over a wide range of applications. The Hub chip supports several high-performance computing protocols (e.g., MPI, RDMA, IP) and also provides a non-coherent system-wide global address space. Collective communication operations such as barriers, reductions, and multi-cast are supported directly in hardware. Multiple routing modes including deterministic as well as hardware-directed random routing are also supported. Finally, the Hub module is capable of operating in the presence of many types of hardware faults and gracefully degrades performance in the presence of lane failures.