Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation

  • Authors:
  • Krishna M. Kavi;Roberto Giorgi;Joseph Arul

  • Affiliations:
  • Univ. of Alabama , Huntsville;Univ. di Siena, Siena, Italy;Univ. of Alabama , Huntsville

  • Venue:
  • IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, the Scheduled Dataflow (SDF) architecture驴a decoupled memory/execution, multithreaded architecture using nonblocking threads驴is presented in detail and evaluated against Superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.