HPS, a new microarchitecture: rationale and introduction
MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Executing a Program on the MIT Tagged-Token Dataflow Architecture
IEEE Transactions on Computers
Multithreading: a revisionist view of dataflow architectures
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
IEEE Micro
A preliminary architecture for a basic data-flow processor
ISCA '75 Proceedings of the 2nd annual symposium on Computer architecture
Two Fundamental Limits on Dataflow Multiprocessing
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
A Practical Data Flow Computer
Computer
Available task-level parallelism on the Cell BE
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Flexible architectural support for fine-grain scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Ease of use with concurrent collections (CnC)
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
OoOJava: an out-of-order approach to parallel programming
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Task superscalar: using processors as functional units
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
OoOJava: software out-of-order execution
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Bahurupi: A polymorphic heterogeneous multi-core architecture
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Implementation of a hierarchical N-body simulator using the Ompss programming model
Proceedings of the first workshop on Irregular applications: architectures and algorithm
DOJ: dynamically parallelizing object-oriented programs
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Dataflow execution of sequential imperative programs on multicore architectures
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hardware support for fine-grained event-driven computation in Anton 2
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
MP-Tomasulo: A Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs
ACM Transactions on Architecture and Code Optimization (TACO)
Colored Petri Net model with automatic parallelization on real-time multicore architectures
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
We present \emph{Task Super scalar}, an abstraction of instruction-level out-of-order pipeline that operates at the task-level. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task super scalar uncovers task-level parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task super scalar pipeline dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task super scalar pipeline front end, that can be embedded into any many core fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as \tilde 50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95–255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task super scalar thus enables programmers to exploit many core systems effectively, while simultaneously simplifying their programming model.