Design and performance of speculative flow control for high-radix datacenter interconnect switches

  • Authors:
  • Cyriel Minkenberg;Mitchell Gusat

  • Affiliations:
  • IBM Research GmbH, Zurich Research Laboratory, Säumerstrasse 4, 8803 Rüschlikon, Switzerland;IBM Research GmbH, Zurich Research Laboratory, Säumerstrasse 4, 8803 Rüschlikon, Switzerland

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

High-radix switches are desirable building blocks for large computer interconnection networks, because they are more suitable to convert chip I/O bandwidth into low latency and low cost than low-radix switches [J. Kim, W.J. Dally, B. Towles, A.K. Gupta, Microarchitecture of a high-radix router, in: Proc. ISCA 2005, Madison, WI, 2005]. Unfortunately, most existing switch architectures do not scale well to a large number of ports, for example, the complexity of the buffered crossbar architecture scales quadratically with the number of ports. Compounded with support for long round-trip times and many virtual channels, the overall buffer requirements limit the feasibility of such switches to modest port counts. Compromising on the buffer sizing leads to a drastic increase in latency and reduction in throughput, as long as traditional credit flow control is employed at the link level. We propose a novel link-level flow control protocol that enables high-performance scalable switches that are based on the increasingly popular buffered crossbar architecture, to scale to higher port counts without sacrificing performance. By combining credited and speculative transmission, this scheme achieves reliable delivery, low latency, and high throughput, even with crosspoint buffers that are significantly smaller than the round-trip time. The proposed scheme substantially reduces message latency and improves throughput of partially buffered crossbar switches loaded with synthetic uniform and non-uniform bursty traffic. Moreover, simulations replaying traces of several typical MPI applications demonstrate communication speedup factors of 2 to 10 times.