An architecture of a dataflow single chip processor
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
The EPSILON-2 multiprocessor system
Journal of Parallel and Distributed Computing - Special issue: data-flow processing
A quantitative analysis of locality in dataflow programs
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Monsoon: an explicit token-store architecture
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Iterative Instructions in the Manchester Dataflow Computer
IEEE Transactions on Parallel and Distributed Systems
Overview of the Monsoon Project
ICCD '91 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
A model for dataflow based vector execution
ICS '94 Proceedings of the 8th international conference on Supercomputing
Control of loop parallelism in multithreaded code
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
An Evaluation of Optimized Threaded Code Generation
PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
The Initial Performance of a Bottom-Up Clustering Algorithm for Dataflow Graphs
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Performance and modularity benefits of message-driven execution
Journal of Parallel and Distributed Computing
A Method for Computing the Number of Iterations in Data Dependent Loops
Real-Time Systems
Hi-index | 0.00 |
Recent evidence indicates that the exploitation of locality in dataflow programs could have a dramatic impact on performance. The current trend in the design of dataflow processors suggest a synthesis of traditional non-strict fine grain instruction execution and a strict coarse grain execution in order to exploit locality. While an increase in instruction granularity will favor the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. In this paper, the resulting latency incurred through the partitioning of fine grain instructions to quantify coarse grain input and output latencies using a set of numeric benchmarks. The results offer compelling evidence that the inner loops of a significant number of numeric codes would benefit from coarse grain execution. Based on cluster execution times, more than 60% of the measured benchmarks favor a coarse grain execution. IN 64% of the cases the input latency to the cluster is the same in coarse or fine grain execution modes. The results suggest that the effects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is offset by available intra-thread locality. Furthermore, simulation results indicate that strict or non-strict data structure access does not change the basic cluster characteristics.