Data-Driven Multithreading Using Conventional Microprocessors

Authors:
Costas Kyriacou;Paraskevas Evripidou;Pedro Trancoso
Affiliations:
-;IEEE;IEEE
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2006

Citing 21
Cited 11

T: a multithreaded massively parallel architecture

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
TAM—a compiler controlled threaded abstract machine

Journal of Parallel and Distributed Computing - Special issue on dataflow and multithreaded architectures
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A design study of the EARTH multiprocessor

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Control of loop parallelism in multithreaded code

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Design of storage hierarchy in multithreaded architectures

Proceedings of the 28th annual international symposium on Microarchitecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Monsoon: an explicit token-store architecture

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
D3-Machine: a decoupled data-driven multithreaded architecture with variable resolution support

Parallel Computing
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors

IEEE Micro
Pointer cache assisted prefetching

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
(R) The Impact of Speeding up Critical Sections with Data Prefetching and Forwarding

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Scaling to the End of Silicon with EDGE Architectures

Computer
Data Prefetching and Data Forwarding in Shared Memory Multiprocessors

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 02
Communication assist for data driven multithreading

PCI'01 Proceedings of the 8th Panhellenic conference on Informatics

A case for chip multiprocessors based on the data-driven multithreading model

International Journal of Parallel Programming
Accurate branch prediction for short threads

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Chip multiprocessor based on data-driven multithreading model

International Journal of High Performance Systems Architecture
Erbium: a deterministic, concurrent intermediate representation to map data-flow tasks to scalable, persistent streaming processes

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
A stream-computing extension to OpenMP

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
DDM-VMc: the data-driven multithreading virtual machine for the cell processor

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Resource-agnostic programming for many-core microgrids

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Hardware budget and runtime system for data-driven multithreaded chip multiprocessor

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Producer-Consumer: the programming model for future many-core processors

ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the Data-Driven Multithreading (DDM) model and how it may be implemented using off-the-shelf microprocessors. Data-Driven Multithreading is a nonblocking multithreading execution model that tolerates internode latency by scheduling threads for execution based on data availability. Scheduling based on data availability can be used to exploit cache management policies that reduce significantly cache misses. Such policies include firing a thread for execution only if its data is already placed in the cache. We call this cache management policy the CacheFlow policy. The core of the DDM implementation presented is a memory mapped hardware module that is attached directly to the processor's bus. This module is responsible for thread scheduling and is known as the Thread Synchronization Unit (TSU). The evaluation of DDM was performed using simulation of the Data-Driven Network of Workstations ({\rm{D}}^2{\rm{NOW}}). {\rm{D}}^2{\rm{NOW}} is a DDM implementation built out of regular workstations augmented with the TSU. The simulation was performed for nine scientific applications, seven of which belong to the SPLASH-2 suite. The results show that DDM can tolerate well both the communication and synchronization latency. Overall, for 16 and 32-node {\rm{D}}^2{\rm{NOW}} machines the speedup observed was 14.4 and 26.0, respectively.