Improved point-to-point and collective communication performance with output-queued high-radix routers

Authors:
Sameer Kumar;Craig Stunkel;Laxmikant V. Kalé
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY;IBM T.J. Watson Research Center, Yorktown Heights, NY;Department of Computer Science, University of Illinois at Urbana-Champaign
Venue:
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Year:
2005

Citing 16
Cited 0

High-performance multi-queue buffers for VLSI communications switches

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact

IEEE Transactions on Parallel and Distributed Systems
HIPIQS: A High-Performance Switch Architecture Using Input Queuing

IEEE Transactions on Parallel and Distributed Systems
Tiny Tera: A Packet Switch Core

IEEE Micro
Performance Evaluation of the Quadrics Interconnection Network

Cluster Computing
Adaptive Source Routing in Multistage Interconnection Networks

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Reliable Hardware Barrier Synchronization Scheme

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Congestion-Free Routing on the CM-5 Data Router

PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Framework for Collective Personalized Communication

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
K-ary N-trees: High Performance Networks for Massively Parallel Architectures

K-ary N-trees: High Performance Networks for Massively Parallel Architectures
Scaling All-to-All Multicast on Fat-tree Networks

ICPADS '04 Proceedings of the Parallel and Distributed Systems, Tenth International Conference
POSE: Getting Over Grainsize in Parallel Discrete Event Simulation

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable NIC-based Reduction on Large-scale Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Multicast scheduling for input-queued switches

IEEE Journal on Selected Areas in Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an output-queued switch architecture with cross-point buffering that has improved performance for both point-to-point communication and hardware accelerated collective communication. In the past, output queuing architectures have been less popular as they require more internal speedup and buffering. However, with current technology it is possible to build output-queued switches with a relatively large number of ports. We demonstrate that our output-queued architecture performs well for point-to-point messages, specially in a fat-tree topology. We also show that output-queued architectures facilitate efficient implementations of multicasts and reductions. We present performance of multicasts and reductions on individual switches and a network of switches interconnected in a fat-tree topology. We also present simulation results based on synthetic workloads that emulate a molecular dynamics application.