Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An implementation of the Hamlyn sender-managed interface architecture
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Thread scheduling for multiprogrammed multiprocessors
Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Non-blocking steal-half work queues
Proceedings of the twenty-first annual symposium on Principles of distributed computing
Scalability of Scheduled Dataflow Architecture (SDF) with Register Contexts
ICA3PP '02 Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing
Performance Evaluation of Task Pools Based on Hardware Synchronization
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A comparison of task pools for dynamic load balancing of irregular algorithms: Research Articles
Concurrency and Computation: Practice & Experience
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Anton, a special-purpose machine for molecular dynamics simulation
Proceedings of the 34th annual international symposium on Computer architecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Proceedings of the 22nd annual international conference on Supercomputing
A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures
DSD '08 Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools
A Hardware Task Scheduler for Embedded Video Processing
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
An evaluation of the TRIPS computer system
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Flexible architectural support for fine-grain scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Chip multiprocessor based on data-driven multithreading model
International Journal of High Performance Systems Architecture
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Task Superscalar: An Out-of-Order Task Pipeline
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in advance, however, if there is uncertainty as to when each task's input data will be available. Ideally, each task would run in direct response to the arrival of its input data, thus allowing the computation to proceed in a fine-grained event-driven manner. Realizing this ideal is difficult in practice, and typically requires sacrificing flexibility for performance. In Anton 2, a massively parallel special-purpose supercomputer for molecular dynamics simulations, we addressed this challenge by including a hardware block, called the dispatch unit, that provides flexible and efficient support for fine-grained event-driven computation. Its novel features include a many-to-many mapping from input data to a set of synchronization counters, and the ability to prioritize tasks based on their type. To solve the additional problem of using a fixed set of synchronization counters to track input data for a potentially large number of tasks, we created a software library that allows programmers to treat Anton 2 as an idealized machine with infinitely many synchronization counters. The dispatch unit, together with this library, made it possible to simplify our molecular dynamics software by expressing it as a collection of independent tasks, and the resulting fine-grained execution schedule improved overall performance by up to 16% relative to a coarse-grained schedule for precisely the same computation.