Task Superscalar: An Out-of-Order Task Pipeline

Authors:
Yoav Etsion;Felipe Cabarcas;Alejandro Rico;Alex Ramirez;Rosa M. Badia;Eduard Ayguade;Jesus Labarta;Mateo Valero
Affiliations:
-;-;-;-;-;-;-;-
Venue:
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2010

Citing 24
Cited 8

HPS, a new microarchitecture: rationale and introduction

MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
Multithreading: a revisionist view of dataflow architectures

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Jade: A High-Level, Machine-Independent Language for Parallel Programming

Computer
The Stanford Hydra CMP

IEEE Micro
A preliminary architecture for a basic data-flow processor

ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
Two Fundamental Limits on Dataflow Multiprocessing

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
A Practical Data Flow Computer

Computer
Available task-level parallelism on the Cell BE

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Ease of use with concurrent collections (CnC)

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
OoOJava: an out-of-order approach to parallel programming

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Task superscalar: using processors as functional units

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism

OoOJava: software out-of-order execution

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Bahurupi: A polymorphic heterogeneous multi-core architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Implementation of a hierarchical N-body simulator using the Ompss programming model

Proceedings of the first workshop on Irregular applications: architectures and algorithm
DOJ: dynamically parallelizing object-oriented programs

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Dataflow execution of sequential imperative programs on multicore architectures

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hardware support for fine-grained event-driven computation in Anton 2

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs

ACM Transactions on Architecture and Code Optimization (TACO)
Colored Petri Net model with automatic parallelization on real-time multicore architectures

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present \emph{Task Super scalar}, an abstraction of instruction-level out-of-order pipeline that operates at the task-level. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task super scalar uncovers task-level parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task super scalar pipeline dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task super scalar pipeline front end, that can be embedded into any many core fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as \tilde 50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95–255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task super scalar thus enables programmers to exploit many core systems effectively, while simultaneously simplifying their programming model.