A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture

  • Authors:
  • Ying Ping Zhang;Taikyeong Jeong;Fei Chen;Haiping Wu;Ronny Nitzsche;Guang R. Gao

  • Affiliations:
  • University of Delaware, Department of Electrical and Computer Engineering, Newark, Delaware;University of Delaware, Department of Electrical and Computer Engineering, Newark, Delaware;University of Delaware, Department of Electrical and Computer Engineering, Newark, Delaware;University of Delaware, Department of Electrical and Computer Engineering, Newark, Delaware;University of Delaware, Department of Electrical and Computer Engineering, Newark, Delaware;University of Delaware, Department of Electrical and Computer Engineering, Newark, Delaware

  • Venue:
  • IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The designs of high-performance processor architectures are moving toward the integration of a large number of multiple processing cores on a single chip. The IBM Cyclops-64 (C64) is a petaflop supercomputer built on multi-core system-on-a-chip technology. Each C64 chip employs a multistage pipelined crossbar switch as its on-chip interconnection network to provide high bandwidth and low latency communication between the 160 thread processing cores, the on-chip SRAM memory banks, and other components. In this paper, we present a study of the architecture and performance of the C64 on-chip interconnection network through simulation. Our experimental results provide observations on the network behavior: (1) Dedicated channels can be created between any output port to input port of the C64 crossbar with latency as low as 7 cycles. The C64 crossbar has the potential reach the full hardware bandwidth, and exhibit a non-blocking behavior; (2) The C64 crossbar is a stable network; (3) The network logic design appears to provide a reasonable opportunity for sharing the channel bandwidth between traffic in either direction; (4) A simple circular neighbor arbitration scheme can achieve competitive performance level comparing to the complex segmented LRU (Least Recently Used) matrix arbitration scheme without losing the fairness. (5) Application-driven benchmarks provide comparable results to synthetic workloads.