Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation

Authors:
Krishna M. Kavi;Roberto Giorgi;Joseph Arul
Affiliations:
Univ. of Alabama , Huntsville;Univ. di Siena, Siena, Italy;Univ. of Alabama , Huntsville
Venue:
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Year:
2001

Citing 30
Cited 16

A unified resource management and execution control mechanism for data flow machines

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Toward a dataflow/von Neumann hybrid architecture

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
I-structures: data structures for parallel computing

ACM Transactions on Programming Languages and Systems (TOPLAS)
Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
The explicit token store

Journal of Parallel and Distributed Computing - Special issue: data-flow processing
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Single instruction stream parallelism is greater than two

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Multithreading: a revisionist view of dataflow architectures

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Super-threading: architectural and software mechanisms for optimizing parallel computation

ICS '93 Proceedings of the 7th international conference on Supercomputing
The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A design study of the EARTH multiprocessor

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Control of loop parallelism in multithreaded code

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems (TOCS)
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
A Chip-Multiprocessor Architecture with Speculative Multithreading

IEEE Transactions on Computers
The Superthreaded Processor Architecture

IEEE Transactions on Computers
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors

IEEE Micro
A Single-Chip Multiprocessor for Smart Terminals

IEEE Micro
A Multithreaded Processor Designed for Distributed Shared Memory Systems

APDC '97 Proceedings of the 1997 Advances in Parallel and Distributed Computing Conference (APDC '97)
On the working set concept for data-flow machines

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Decoupled access/execute computer architectures

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Design and performance evaluation of a multithreaded architecture

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Superscalar Execution with Direct Data Forwarding

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques

What can we gain by unfolding loops?

ACM SIGPLAN Notices
An Efficient Way of Passing of Data in a Multithreaded Scheduled Dataflow Architecture

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Data-Driven Multithreading Using Conventional Microprocessors

IEEE Transactions on Parallel and Distributed Systems
A case for chip multiprocessors based on the data-driven multithreading model

International Journal of Parallel Programming
Performance Enhancement by Eliminating Redundant Function Execution

ANSS '06 Proceedings of the 39th annual Symposium on Simulation
A non-preemptive scheduling algorithm for soft real-time systems

Computers and Electrical Engineering
A hybrid closed queuing network approach to model dataflow in networked distributed processors

Computer Communications
A closed queuing network model with multiple servers for multi-threaded architecture

Computer Communications
A hybrid open queuing network model approach for multi-threaded dataflow architecture

Computer Communications
Exploiting an abstract-machine-based framework in the design of a Java ILP processor

Journal of Systems Architecture: the EUROMICRO Journal
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
A hybrid closed queuing network model for multi-threaded dataflow architecture

Computers and Electrical Engineering
Chip multiprocessor based on data-driven multithreading model

International Journal of High Performance Systems Architecture
Exploiting dataflow to extract java instruction level parallelism on a tag-based multi-issue semi in-order (TMSI) processor

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Trebuchet: exploring TLP with dataflow virtualisation

International Journal of High Performance Systems Architecture
Simulating the future kilo-x86-64 core processors and their infrastructure

Proceedings of the 45th Annual Simulation Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, the Scheduled Dataflow (SDF) architecture驴a decoupled memory/execution, multithreaded architecture using nonblocking threads驴is presented in detail and evaluated against Superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.