Scheduling Tasks with AND/OR Precedence Constraints
SIAM Journal on Computing
IEEE Transactions on Parallel and Distributed Systems
Introduction to algorithms
Practical Pram Programming
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
IEEE Transactions on Parallel and Distributed Systems
Polynomial complete scheduling problems
SOSP '73 Proceedings of the fourth ACM symposium on Operating system principles
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
Task Scheduling for Parallel Systems (Wiley Series on Parallel and Distributed Computing)
Task Scheduling for Parallel Systems (Wiley Series on Parallel and Distributed Computing)
Fpga-based prototype of a pram-on-chip processor
Proceedings of the 5th conference on Computing frontiers
Scheduling Algorithms
An efficient algorithm for exploiting multiple arithmetic units
IBM Journal of Research and Development
Hi-index | 0.00 |
We consider many-core processors with a task-graph oriented programming model, whereby scheduling constraints among tasks are decided offline, and are then enforced by the runtime system using dedicated hardware. Here, exposing and beneficially exploiting fine grain data and control parallelism is increasingly important. Therefore, high expressive power for stating such constraints/directives, along with the ability to implement them in fast, simple hardware, is critical for success. In this paper, we focus on the relationship among different duplicable (multi-instance) tasks, which are used to express and exploit data parallelism. We extend the conventional Start-After-Complete (precedence) constraint to also be usable between replicas of different such tasks rather than only between entire tasks, thereby increasing the exposable parallelism. Additionally, we propose the parameterized Start-After-Start constraint, which can be used to control the degree of ''lockstep'' among multiple such tasks, e.g., in order to improve cache performance when the tasks work on the same data. Also, we briefly describe several additional interesting directives. Finally, we show that the directives can be supported efficiently in hardware. Hypercore, a very efficient CREW PRAM-like shared-cache architecture, which is very challenging because it has extremely fast dispatching for basic constraints, is used in the discussion. However, the new directives have broader applicability. Having shown the possibility of simple implementation and indications of benefit, this motivates further exploration of these directives and their implementation in hardware, as well as their support by programming tools.