Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Optimum Broadcasting and Personalized Communication in Hypercubes
IEEE Transactions on Computers
Intensive hypercube communication. Prearranged communication in link-bound machines
Journal of Parallel and Distributed Computing
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimal broadcast and summation in the LogP model
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
The IBM external user interface for scalable parallel systems
Parallel Computing - Special issue: message passing interfaces
Designing broadcasting algorithms in the Postal Model for message-passing systems
Proceedings of the 4th ACM symposium on Parallel algorithms and architectures
CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
Optimal computation of census functions in the postal model
Discrete Applied Mathematics
Document for a Standard Message-Passing Interface
Document for a Standard Message-Passing Interface
Efficient Communication Operations in Reconfigurable Parallel Computers
Efficient Communication Operations in Reconfigurable Parallel Computers
Efficient Broadcasting in Wormhole-Routed Multicomputers: A Network-Partitioning Approach
IEEE Transactions on Parallel and Distributed Systems
Scaling Simulation of the Fusing-Restricted Reconfigurable Mesh
IEEE Transactions on Parallel and Distributed Systems
On scheduling all-to-all personalized connections and cost-effective designs in WDM rings
IEEE/ACM Transactions on Networking (TON)
IEEE Transactions on Parallel and Distributed Systems
A software architecture for user transparent parallel image processing
Parallel Computing - Parallel computing in image and video processing
Scalable NIC-based Reduction on Large-scale Clusters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Recursion-Based Broadcast Paradigm in Wormhole Routed Networks
IEEE Transactions on Parallel and Distributed Systems
NIC-based reduction algorithms for large-scale clusters
International Journal of High Performance Computing and Networking
Optimal broadcast for fully connected processor-node networks
Journal of Parallel and Distributed Computing
Bandwidth optimal all-reduce algorithms for clusters of workstations
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
There are a number of models that were proposed in recent years for message passing parallel systems. Examples are the postal model and its generalization the LogP model. In the postal model a parameter 驴 is used to model the communication latency of the message-passing system. Each node during each round can send a fixed-size message and, simultaneously, receive a message of the same size. Furthermore, a message sent out during round r will incur a latency of 驴 and will arrive at the receiving node at round r + 驴驴 1.Our goal in this paper is to bridge the gap between the theoretical modeling and the practical implementation. In particular, we investigate a number of practical issues related to the design and implementation of two collective communication operations, namely, the broadcast operation and the global combine operation. Those practical issues include, for example, 1) techniques for measurement of the value of 驴 on a given machine, 2) creating efficient broadcast algorithms that get the latency 驴 and the number of nodes n as parameters and 3) creating efficient global combine algorithms for parallel machines with 驴 which is not an integer. We propose solutions that address those practical issues and present results of an experimental study of the new algorithms on the Intel Delta machine. Our main conclusion is that the postal model can help in performance prediction and tuning, for example, a properly tuned broadcast improves the known implementation by more than 20%.