A distributed, simultaneously multi-threaded (SMT) processor with clustered scheduling windows for scalable DSP performance

  • Authors:
  • Mladen Berekovic;Tim Niggemeier

  • Affiliations:
  • IMEC, Eindhoven, The Netherlands and TU Delft, Delft, The Netherlands;IBM Deutschland Entwicklung GmbH, Böblingen, Germany

  • Venue:
  • Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.