Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
High speed switch scheduling for local area networks
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Parallel hierarchical N-body methods
Parallel hierarchical N-body methods
Anatomy of a message in the Alewife multiprocessor
ICS '93 Proceedings of the 7th international conference on Supercomputing
Software overhead in messaging layers: where does the time go?
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Remote queues: exposing message queues for optimization and atomicity
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Scheduling of unstructured communication on the Intel iPSC/860
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Limits on Interconnection Network Performance
IEEE Transactions on Parallel and Distributed Systems
How to Get Good Performance from the CM-5 Data Network
Proceedings of the 8th International Symposium on Parallel Processing
Many-to-many personalized communication with bounded traffic
FRONTIERS '95 Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers'95)
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
SPLASH: Stanford parallel applications for shared-memory*
SPLASH: Stanford parallel applications for shared-memory*
Hi-index | 0.00 |
Brewer & Kuszmaul (1994) demonstrated how barriers and traffic interleaving can alleviate the problem of bulk-transfer performance degradation on the Thinking Machines CM-5 massively parallel processor (MPP) by exploiting the observation that one-on-one communication avoids network congestion. We apply and extend these techniques on the Intel Paragon and MIT Alewife machines. Because these machines lack the CM-5's fast hardware support for barriers, we introduce a token-passing scheme that avoids barriers while maintaining one-on-one communication. We also introduce a new algorithm-distributed dynamic scheduling-that brings Brewer & Kuszmaul's observations to bear on irregular traffic patterns by massaging traffic into a sequence of near-permutations at runtime, without requiring any preprocessing or global state. The measured performance of our algorithm exceeds that of traffic interleaving (the most effective technique proposed by Brewer & Kuszmaul) on all three platforms, and is comparable to the performance of static scheduling, which requires preprocessing and global state.