Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
An architecture for optimal all-to-all personalized communication
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
IEEE Transactions on Parallel and Distributed Systems
Designing Tree-Based Barrier Synchronization on 2D Mesh Networks
IEEE Transactions on Parallel and Distributed Systems
Barrier Synchronization on Wormhole-Routed Networks
IEEE Transactions on Parallel and Distributed Systems
The architecture of the DIVA processing-in-memory chip
ICS '02 Proceedings of the 16th international conference on Supercomputing
A Reliable Hardware Barrier Synchronization Scheme
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
ASAP '02 Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors
Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Complete Exchange Algorithms in Wormhole-Routed Torus Networks: A Divide-and-Conquer Strategy
ISPAN '99 Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks
MPI Performance Evaluation and Characterization using a Compact Application Benchmark Code
MPIDC '96 Proceedings of the Second MPI Developers Conference
An Area-Efficient Router for the Data-Intensive Architecture (DIVA) System
VLSID '04 Proceedings of the 17th International Conference on VLSI Design
Cost-Performance Trade-Offs in Networks on Chip: A Simulation-Based Approach
Proceedings of the conference on Design, automation and test in Europe - Volume 2
Hi-index | 0.00 |
The performance of user-level messaging in PIM (Processing-In-Memory) to PIM communication is modeled and analyzed for the DIVA (Data IntensiVe Architecture) system. Six benchmarks have been used for this purpose, two from each category, namely single message transfer, parallel transfer and collective communication, as described for the PMB (Pallas MPI Benchmarks). The benchmarks used are PingPong, PingPing, SendReceive, Exchange, Barrier synchronization and AllToAll personalized exchange. The main significance of this work lies in the evaluation of an implementation of system-wide support for memory-to-memory and memory-to-host communi-cation via a parcel buffer (used as a network interface). Another remarkable feature of this evaluation lies in presenting an optimal algorithm for Barrier synchronization and an optimal algorithm, with full channel utilization, for AllToAll personalized exchange for the bi-directional ring configuration of up to 8 DIVA PIMs in the memory system of a Hewlett-Packard’s zx6000 server. The algorithms presented can be scaled for higher number of PIM chips with a little degradation in performance over the optimal solution. Our analysis shows that the currently employed communication mechanism can be used very efficiently for collective communication operations, and it also exposes the bottlenecks in the current design for future improvements.