Generic topology mapping strategies for large-scale parallel architectures
Proceedings of the international conference on Supercomputing
Cache injection for parallel applications
Proceedings of the 20th international symposium on High performance distributed computing
PERCS: the IBM power7-IH high-performance computing system
IBM Journal of Research and Development
Performance modeling for systematic performance tuning
State of the Practice Reports
An early performance analysis of POWER7-IH HPC systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Avoiding hot-spots on two-level direct networks
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ACM SRC poster: optimizing all-to-all algorithm for PERCS network using simulation
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Visualization of simulation results for the PERCS Hub chip performance verification
Proceedings of the 4th International ICST Conference on Simulation Tools and Techniques
Runtime detection and optimization of collective communication patterns
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Looking under the hood of the IBM blue gene/Q network
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cray cascade: a scalable HPC system based on a Dragonfly network
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic communication coalescing for irregular computations in UPC language
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
The impact of global communication latency at extreme scales on Krylov methods
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Collectives on two-tier direct networks
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Improving communication in PGAS environments: static and dynamic coalescing in UPC
Proceedings of the 27th international ACM conference on International conference on supercomputing
The power 775 architecture at scale
Proceedings of the 27th international ACM conference on International conference on supercomputing
Evaluating on-die interconnects for a 4 TB/s router
Proceedings of the 27th international ACM conference on International conference on supercomputing
Distributed full switch as an ideal system area network for multiprocessor computers
Automation and Remote Control
Global misrouting policies in two-level hierarchical networks
Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip
High and stable performance under adverse traffic patterns of tori-connected torus network
Computers and Electrical Engineering
Enabling highly-scalable remote memory access programming with MPI-3 one sided
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable high-radix router microarchitecture using a network switch organization
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance implications of remote-only load balancing under adversarial traffic in Dragonflies
Proceedings of the 8th International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
Understanding system design for big data workloads
IBM Journal of Research and Development
Hi-index | 0.00 |
The PERCS system was designed by IBM in response to a DARPA challenge that called for a high-productivity high-performance computing system. A major innovation in the PERCS design is the network that is built using Hub chips that are integrated into the compute nodes. Each Hub chip is about 580 mm$^2$ in size, % uses 45 nm IBM CMOS 12S0 SOI technology with 13 levels of metal, has over 3700 signal I/Os, and is packaged in a module that also contains LGA-attached optical electronic devices. The Hub module implements five types of high-bandwidth interconnects with multiple links that are fully-connected with a high-performance internal crossbar switch. These links provide over 9 Tbits/second of raw bandwidth and are used to construct a two-level direct-connect topology spanning up to tens of thousands of \PS{} chips with high bisection bandwidth and low latency. The Blue Waters System, which is being constructed at NCSA, is an exemplar large-scale PERCS installation. Blue Waters is expected to deliver sustained Pet scale performance over a wide range of applications. The Hub chip supports several high-performance computing protocols (e.g., MPI, RDMA, IP) and also provides a non-coherent system-wide global address space. Collective communication operations such as barriers, reductions, and multi-cast are supported directly in hardware. Multiple routing modes including deterministic as well as hardware-directed random routing are also supported. Finally, the Hub module is capable of operating in the presence of many types of hardware faults and gracefully degrades performance in the presence of lane failures.