Hardware support for fine-grained event-driven computation in Anton 2

Authors:
J. P. Grossman;Jeffrey S. Kuskin;Joseph A. Bank;Michael Theobald;Ron O. Dror;Douglas J. Ierardi;Richard H. Larson;U. Ben Schafer;Brian Towles;Cliff Young;David E. Shaw
Affiliations:
D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research, New York, NY, USA;D. E. Shaw Research and Center for Computational Biology and Bioinformatics, Columbia University, New York, NY, USA
Venue:
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Year:
2013

Citing 21
Cited 0

Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
An implementation of the Hamlyn sender-managed interface architecture

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Non-blocking steal-half work queues

Proceedings of the twenty-first annual symposium on Principles of distributed computing
Scalability of Scheduled Dataflow Architecture (SDF) with Register Contexts

ICA3PP '02 Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing
Performance Evaluation of Task Pools Based on Hardware Synchronization

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A comparison of task pools for dynamic load balancing of irregular algorithms: Research Articles

Concurrency and Computation: Practice & Experience
Dynamic circular work-stealing deque

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Anton, a special-purpose machine for molecular dynamics simulation

Proceedings of the 34th annual international symposium on Computer architecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
On-Chip Interconnection Architecture of the Tile Processor

IEEE Micro
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures

DSD '08 Proceedings of the 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools
A Hardware Task Scheduler for Embedded Video Processing

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
An evaluation of the TRIPS computer system

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Flexible architectural support for fine-grain scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Chip multiprocessor based on data-driven multithreading model

International Journal of High Performance Systems Architecture
Exploiting 162-Nanosecond End-to-End Communication Latency on Anton

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Task Superscalar: An Out-of-Order Task Pipeline

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in advance, however, if there is uncertainty as to when each task's input data will be available. Ideally, each task would run in direct response to the arrival of its input data, thus allowing the computation to proceed in a fine-grained event-driven manner. Realizing this ideal is difficult in practice, and typically requires sacrificing flexibility for performance. In Anton 2, a massively parallel special-purpose supercomputer for molecular dynamics simulations, we addressed this challenge by including a hardware block, called the dispatch unit, that provides flexible and efficient support for fine-grained event-driven computation. Its novel features include a many-to-many mapping from input data to a set of synchronization counters, and the ability to prioritize tasks based on their type. To solve the additional problem of using a fixed set of synchronization counters to track input data for a potentially large number of tasks, we created a software library that allows programmers to treat Anton 2 as an idealized machine with infinitely many synchronization counters. The dispatch unit, together with this library, made it possible to simplify our molecular dynamics software by expressing it as a collection of independent tasks, and the resulting fine-grained execution schedule improved overall performance by up to 16% relative to a coarse-grained schedule for precisely the same computation.