A model for dataflow based vector execution

  • Authors:
  • W. Marcus Miller;Walid A. Najjar;A. P. Wim Böhm

  • Affiliations:
  • IBM Corporation, Networking Systems Division, Research Triangle Park, NC;Department of Computer Science, Colorado State University, Fort Collins, CO;Department of Computer Science, Colorado State University, Fort Collins, CO

  • Venue:
  • ICS '94 Proceedings of the 8th international conference on Supercomputing
  • Year:
  • 1994

Quantified Score

Hi-index 0.01

Visualization

Abstract

Although the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems: Synchronization at the instruction level, token matching, coloring and re-labeling operations have a negative impact on performance by significantly increasing the number of non-compute “overhead” cycles. Recently, many novel Hybrid von-Neumann Data Driven machines have been proposed to alleviate some of these problems. The major objective has been to reduce or eliminate unnecesssary synchronization costs through simplified operand matching schemes and increased task granularity. Moreover, the results from recent studies quantifying locality suggest sufficient spatial and temporal locality is present in dataflow execution to merit its exploitation.In this paper we present a data structure for exploiting locality in a data driven environment: the Vector Cell. A Vector Cell consists of a number of fixed length chunks of data elements. Each chunk is tagged with a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. We describe the semantics of the model, processor architecture and instruction set as well as a Sisal to dataflow vectorizing compiler back-end. The model is evaluated by comparing its performance to those of both a classical fine-grain dataflow processor employing I-structures and a conventional pipelined vector processor. Results indicate the model is surprisingly resilient to long memory and communication latencies, and is able to dynamically exploit the underlying parallelism across multiple processing elements at run time.