An analysis of loop latency in dataflow execution

Authors:
Walid A. Najjar;W. Marcus Miller;A. P. Wim Böhm
Affiliations:
-;-;-
Venue:
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Year:
1992

Citing 6
Cited 6

An architecture of a dataflow single chip processor

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
The EPSILON-2 multiprocessor system

Journal of Parallel and Distributed Computing - Special issue: data-flow processing
A quantitative analysis of locality in dataflow programs

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Iterative Instructions in the Manchester Dataflow Computer

IEEE Transactions on Parallel and Distributed Systems
Overview of the Monsoon Project

ICCD '91 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors

A model for dataflow based vector execution

ICS '94 Proceedings of the 8th international conference on Supercomputing
Control of loop parallelism in multithreaded code

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
An Evaluation of Optimized Threaded Code Generation

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
The Initial Performance of a Bottom-Up Clustering Algorithm for Dataflow Graphs

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Performance and modularity benefits of message-driven execution

Journal of Parallel and Distributed Computing
A Method for Computing the Number of Iterations in Data Dependent Loops

Real-Time Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent evidence indicates that the exploitation of locality in dataflow programs could have a dramatic impact on performance. The current trend in the design of dataflow processors suggest a synthesis of traditional non-strict fine grain instruction execution and a strict coarse grain execution in order to exploit locality. While an increase in instruction granularity will favor the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. In this paper, the resulting latency incurred through the partitioning of fine grain instructions to quantify coarse grain input and output latencies using a set of numeric benchmarks. The results offer compelling evidence that the inner loops of a significant number of numeric codes would benefit from coarse grain execution. Based on cluster execution times, more than 60% of the measured benchmarks favor a coarse grain execution. IN 64% of the cases the input latency to the cluster is the same in coarse or fine grain execution modes. The results suggest that the effects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is offset by available intra-thread locality. Furthermore, simulation results indicate that strict or non-strict data structure access does not change the basic cluster characteristics.